Skip navigation
Joao Gameiro


Posted by Joao Gameiro Employee Dec 22, 2017



Charts are a good way to visualize our data, but in some cases they are not enough.

Imagine you need to show a vehicle current status, based on his sensor data, don't you think the image below is more suitable than a table or chart?




The intention is to expose in the easiest way the information needed to support future actions.


Despite all possibilities to do this using the available methods, the idea of using an SVG image, which could be changed according to the data being received, sounded really cool.


As you might know, SVG images are basically XML files with four interesting characteristics (at least the ones that matter for this component):

  • are scalable,
  • normally end up with file sizes shorter than similar image formats,
  • can have identifiable elements on their structure,
  • modifiable once injected into the HTML DOM, via javascript.


Merging the approach used on SVG component and VIZ-APIVIZ-SVG was born.



  • Javascript knowledge
  • Pentaho 8.0
  • npm

If you don’t have the node.js (npm) already installed follow this instruction:

How to build your own visualization using Viz-SVG project


Download the git folder example.


Extract the zip file, the necessary files are already there:

  • SVG image – circle.svg
  • The Model – yourmodel.js
  • Dummy data – data.json




  <title>SVG Example</title>
  <rect fill="#fff" id="canvas_background" height="402" width="582" y="-1" x="-1"/>
  <g display="none" overflow="visible" y="0" x="0" height="100%" width="100%" id="canvasGrid">
   <rect fill="url(#gridpattern)" stroke-width="0" y="0" x="0" height="100%" width="100%"/>
  <ellipse id="Sales" fill="lightgray" stroke="black" ry="70" rx="70"  cy="195" cx="281" stroke-width="5" />


  • svgElementID - the element name on circle.svg
  • value - bussines data


  "model": [
         {"name": "svgElementID", "type": "string", "label": "svg Element ID"},
         {"name": "value", "type": "number", "label": "Value"}
  "rows": [
         {"c": [ "Sales",  11000000]}


var SvgModel = BaseModel.extend({
     $type: {
          // SVG Label and Class
          styleClass: "pentaho-visual-samples-svg",
          label: "SVG Circle"

     getSvgPath: function() {
          return "./circle.svg"; 

     getSvgPartforDataID: function(dataId){
          return ["Stroke", "Fill"];

     toSvgId: function(dataId, prop){
          return dataId ;

     toSvgAttribute: function(dataId, prop, dataValue) {
               case "Stroke": return "stroke";
               case "Fill": return "fill";
          return prop;

     toSvgValue: function(dataId, prop, dataValue) {
               case "Stroke": return dataValue > 11000000 ? "red" : "green";
               case "Fill": return dataValue > 11000000? "green" : "red";


Pay attention to:

  • Label, this is the name that will appear in the list of available visualizations in Analyzer
label: "SVG Circle"


  • Path to your image
getSvgPath: function() {
     return "./circle.svg"; 


  • Properties that will be changed for each SVG element, (Stroke and Fill)
getSvgPartforDataID: function(dataId){
     return ["Stroke", "Fill"];


  • SVG id - if you need to change the SVG element id add some logic to this function
    • Take a look at the other example (modelProductLine.js)
toSvgId: function(dataId, prop){
     return dataId ;


  • Mapping between properties defined previous and SVG style properties
toSvgAttribute: function(dataId, prop, dataValue) {
          case "Stroke": return "stroke";
          case "Fill": return "fill";
     return prop;


  • Define your business logic
toSvgValue: function(dataId, prop, dataValue) {
          case "Stroke": return dataValue > 11000000 ? "red" : "green";
          case "Fill": return dataValue > 11000000? "green" : "red";



Now it's time to test our visualization

Register our files on index.html (Index.html purpose is to render our code on the browser.)

  • data.json is our dummy data
  • yourModel is our configuration


      ], function (Context, Table, dataSpec) {

        // Setup up a VizAPI context.
        Context.createAsync({ application: "viz-api-sandbox" })
          .then(function (context) {
            // Get the model and base view types
            return context.getDependencyAsync({
              BarModel: "pentaho/visual/samples/bar/modelProductLine",
              BaseView: "pentaho/visual/base/view"



line 04 - "json!./sales-by-product-family.json"   to    "json!./data.json"   and


line 13 - BarModel: "pentaho/visual/samples/bar/modelProductLine",  to   BarModel: "pentaho/visual/samples/bar/yourModel",


Let’s open index.html on browser.

Attention: directly opening the file through the filesystem will not work when using Google Chrome (and possibly other browsers), because of security restrictions that disallow the loading of local resources using XHR — a functionality that is required by the VIZ-SVG to load localization bundles and other resources.

To overcome this limitation we will use Node.js to serve the project files through an HTTP server.



npm install -g node-static

static -p 8001


Now you can open in the browser and test.


Change the values of data.json to modify colors

Edit data.json and change 11000000 to -> 11000001.


Refresh your browser

Colors should change (if not, clear browser cache).


Edit your model until your visualization is OK.


Create the Pentaho Web Package

Register your model on config.js and package.json

  • Config.js
    •      add the following line:\w+)$/, "yourModel")(\w+)$/, "yourModel")
select: {
          type: [
     \w+)$/, "modelGeneric"),
     \w+)$/, "modelProductLine"),
     \w+)$/, "yourModel")


  • package.json
    • add the following line: "pentaho-visual-samples-svg/yourModel": {"base": "pentaho/visual/base/model"}
 "pentaho/typeInfo": {
      "pentaho-visual-samples-svg/modelProductLine": {"base": "pentaho/visual/base/model"},
      "pentaho-visual-samples-svg/yourModel": {"base": "pentaho/visual/base/model"}


Let's create our .tgz file


Note: By using the npm pack command you ensure only your files and bundled dependencies are compressed in the resulting tgz file.


Run npm pack on viz-svg-sandbox folder.


Deploy on Pentaho server and PDI


Copy the pentaho-visual-samples-svg-0.0.1.tgz to

  • pentaho-solutions/system/karaf/deploy/ (Pentaho Server)
  • data-integration/system/karaf/deploy   (PDI)


See the visualization on Pentaho BA server and PDI


Create a new Analysis Report using SampleData

  • Choose
    • Measure - Actual
    • Category – Department


Pick on the list, right side, our SVG Circle visualization.




Do the same on PDI (DET)

  • Choose
    • Measure - Actual
    • Category – Department





Our .tgz has another extra visualization: SVG Product Line.






You can find more examples in Pentaho Marketplace, search by VIZ-SVG, and install both.




Why use Viz-SVG


Viz-SVG can be a good way to visualize your IoT data during all pipeline process, PDI, Analyzer and CTools dashboards.

Example: you may need to check the occupancy on your office floor, what best than an SVG office layout.



Enjoy and share your visualizations.

Joao Barbosa

SVG Component

Posted by Joao Barbosa Employee Dec 22, 2017


Born from the need to simplify data consumption on dashboards, this component's goal is to provide more flexibility by leveraging SVG rendering capabilities.


The Spark and the Idea

The design of a particular dashboard to show a vehicle current status, based on his sensor data, required the use of the next image.


The intention was to expose in the easiest way the information needed to support future actions.


Despite all possibilities to do this using the available CTools components and methods, the idea of using an SVG image on the dashboard, which could be changed according to the data being received, sounded really cool.


As you might know, SVG images are basically XML files with four interesting characteristics (at least the ones that matter for this component): they are scalable, they normally end up with file sizes shorter than similar image formats, they can have identifiable elements on their structure, and they are modifiable once injected into the HTML DOM, via javascript.


So, to show something similar to the designed data visualization for our dashboard we only needed to produce the SVG image (badly represented on the people's icon below), with ids on the elements that we wanted to change, gather the information that each sensor was sending, and modify the SVG layout (colors and/or text), using the required business rules.



How to Use

With this in mind and a couple more requirements, we ended up by creating the SVG component's first version that is available through Pentaho Marketplace.The component is quite simple, you can add it to your dashboard layout just like any other component, and its use can be controlled mostly be these 4 parameters:


  • URL - the relative path to the SVG image that will be used to show your data
  • Keep Image - the indication if an SVG reload should be performed (or not) when a dashboard refresh is required by the user
  • CallBackDraw - a courtesy function that will be executed after all the SVG changes are performed
  • Data source - The information that will be used to manage all the modifications necessary to show your data current status


Datasource Format and other requirements

In fact, everything is going to be managed by the information on your component's data source. Each record will represent a new SVG modification and is characterized by the following 4 attributes (the order is not significant but the names are):


  • svgElementID - the SVG element id where we want the component to perform changes
  • changeType - the kind of change the component should execute on the current element. This first version allows style (any style attribute common on CSS styling can be applied to SVG images) and text updates
  • styleProperty - the style property that will be added/changed. On text type operations this attribute is not mandatory
  • value - the value that will be used on the style property or the text update.



CDA example - data source record to perform the people icon's change illustrated earlier


Summing up, the process to use this component should be something like this:

  • Install it from Pentaho Marketplace (server restart will be needed). The component's use was only tested on Pentaho 7.0 versions or higher.
  • Review the component's samples that will be available after installation on plugin-samples / sample-svg-component
  • On a requireJS dashboard, add your layout placeholder, like for any other component, add the SVG component and start to fill the component's parameters
  • create and join your data source with the component specification
  • test it on browsers that support HTML 5.0 and SVG rendering (and pay attention that your audience should also have this type of browsers)


Where and when to use it

Whenever you sense that data readability is improved by using a specific image, and you are about to show it on a CTools dashboard, consider the SVG component. Remember, you can always show sensor data on a table or a particular chart, but the data coming from car sensors will always be more easy to read if we can show it on a related image.


Related work

To spread the use of these visualizations in the Pentaho product family, a merge between Viz-API and SVG capabilities is also available. This new artifact, called Viz-SVG, will allow you to use this type of visualizations on Pentaho Analyzer, Pentaho Data Integration DET and also CTools.

JDBC SQL logging

Posted by Kleyson Rios Employee Dec 19, 2017



A very typical use case for SQL logging is to troubleshoot various database related issues during development.


For this purpose, the P6Spy is a useful library to log all SQL statement and parameter values before send it to database. The P6Spy is free, it’s use to intercepts and logs all your database SQL statements into a log file, and it works for any application that uses JDBC driver.


If your application uses a DataSource, simply wrap your current DataSource object with P6DataSource. P6DataSource has a constructor method that accepts the DataSource to wrap. This is by far the simplest method especially if you use a dependency injection framework such as Spring or Guice.


If your application obtains connections from DriverManager, simply modify your JDBC connection URL to include p6spy: . For example, if your URL is jdbc:mysql://host/db then just change it to jdbc:p6spy:mysql://host/db . P6Spy implements the JDBC 4.0 API allowing automatic registration of our JDBC driver with DriverManager. As long as your application obtains connections through DriverManager, you only need to modify your database connection URL to activate P6Spy.


Using it


Pentaho can be configured to leverage on the P6Spy logging capability as well.


To demonstrate its use, let's configure the P6Spy to log all the Sql queries passing thru the SampleData datasource.


1. Download the P6Spy .zip file and unzip it.


2. Copy into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder the p6spy-x.x.x-log4j-nodep.jar and p6spy-x.x.x.jar files.


3. Edit the pentaho-server/tomcat/webapps/pentaho/WEB-INF/classes/log4j.xml file and append at the end of the file, before the tag </log4j:configuration>, the following configuration:


   <!-- ========================================================= -->
   <!-- Special Log File specifically for p6spy                   -->
   <!-- ========================================================= -->

   <appender name="SPYLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="../logs/spy.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>

   <category name="p6spy">
      <priority value="DEBUG"/>
      <appender-ref ref="SPYLOG"/>


NOTE: If the spy.log file is not created after the start of the Pentaho server, update the File param with the absolute path.


4. Create the pentaho-server/tomcat/lib/ file and insert the following configuration as its content:




5. Edit the pentaho-server/tomcat/webapps/pentaho/META-INF/context.xml and add the following configuration for the new P6Spy datasource:


<Resource name="jdbc/DSspy" auth="Container" type="javax.sql.DataSource"
   factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" maxActive="20" minIdle="0" maxIdle="5" 
   initialSize="0" maxWait="10000" 
   username="pentaho_admin" password="password"
   validationQuery="select count(*) from INFORMATION_SCHEMA.SYSTEM_SEQUENCES"/>


6. Restart the Pentaho server.


7. Logs in with Admin in the PUC and click on the Manage Data Sources button.


Screen Shot 2017-12-18 at 10.23.26.png


8. Edit the datasource SampleData/JDBC.


Screen Shot 2017-12-18 at 10.24.24.png


9. Change the Access type from Native(JDBC) to JNDI, and set the JNDI Name to DSspy and Test the connection.


Screen Shot 2017-12-18 at 10.28.06.png


Screen Shot 2017-12-18 at 10.28.53.png


10. Test it. Open the CTools Dashboard under the Public/Steel Wheels folder and/or create a new Analysis Report to use the SteelWheelsSales cube.


Screen Shot 2017-12-19 at 08.59.46.png


Screen Shot 2017-12-19 at 09.03.15.png



11. Verify all the SQL queries being logged in the spy.log file.


Screen Shot 2017-12-19 at 09.06.30.png



  • current time - the current time is obtained through System.getCurrentTimeMillis() and represents the number of milliseconds that have passed since January 1, 1970 00:00:00.000 GMT. (Refer to the J2SE documentation for further details on System.getCurrentTimeMillis().) To change the format, use the dateformat property described in Common Property File Settings.
  • execution time - the time it takes in milliseconds for a particular method to execute. (This is not the total cost for the SQL statement.) For example, a statementSELECT * FROM MYTABLE WHERE THISCOL = ? might be executed as a prepared statement, in which the .execute() function will be measured. This is recorded as the statement category. Further, as you call .next() on the ResultSet, each .next() call is recorded in the result category.
  • category - You can manage your log by including and excluding categories, which is described in Common Property File Settings.
  • connection id - Indicates the connection on which the activity was logged. The connection id is a sequentially generated identifier.
  • statement SQL string - This is the SQL string passed to the statement object. If it is a prepared statement, it is the prepared statement that existed prior to the parameters being set. To see the complete statement, refer to effective SQL string.
  • effective SQL string - If you are not using a prepared statement, this contains no value. Otherwise, it fills in the values of the Prepared Statement so you can see the effective SQL statement that is passed to the database. Of course, the database still sees the prepared statement, but this string is a convenient way to see the actual values being sent to the database.




  • For a production environment, check out the impact on the performance.
  • SQL queries will not be logged if the data was already loaded in the cache.
  • No tests was done for scheduled tasks.


Multi-tenant Environment


For multi-tenant environment you might have the need to have the logged user on the log file as well. For that, you can also supply your own log message formatter to customize the format, including new data on it. Simply create a class which implements the com.p6spy.engine.spy.appender.MessageFormattingStrategy interface and place it on the classpath.


Customising the Log Message


1. Copy into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder the attached P6SpyCustomMessage-1.0.0.jar file.


2. Append the to pentaho-server/tomcat/lib/ file following configuration:






3. Restart the Pentaho Server.


4. Execute again the Reports/Dashboards and check out the new log message.


Screen Shot 2017-12-19 at 09.39.33.png



Enjoy !

This article was co-authored with Benjamin Webb


Foundational to any map—whether it be a globe, GPS or any online map— is the functionality to understand data on specific locations. The ability to plot geospatial data is powerful as it allows one to distinguish, aggregate and display information in a very familiar manner. Using Pentaho, one can use shape files to plot areas on a map within a dashboard and then explore data geospatially. In this example, we can use C*Tools and Pentaho Data Integration to examine geographic spread of crime occurring in the city of Chicago.


Getting and Preparing Shapefiles


There are many popular formats of shapefiles. The most popular is the format developed and regulated by ESRI. Many geographic analytics packages produce this format, so it is relatively easy to find shapefiles for common geographic boundaries, including Country, State/Province, County, Postal Code, Political boundaries and more. For this analysis, we’ll use data provided by the Chicago Data Portal.


First, to get the shapefiles, we will download the Community Area Boundaries datafile in ESRI format. To use in Pentaho, we will prepare the shapefile by converting ESRI to GeoJSON. We will use a command line tool provided by GDAL titled ogr2ogr. More information on this suite of tools can be found on their website and on GitHub. To execute this tool we can use the following PDI transformation that will call ogr2ogr.exe with parameters including GeoJSON for the destination filetype as well as the source and destination files.

From this process, important information is collected on the 76 community areas in Chicago. As seen below in a small sample of the GeoJSON file created, information is contained for the community of Douglas including a series of latitudes and longitudes representing the points that form a polygon.



"type": "FeatureCollection",

"name": "geo_export_80095076-0a6b-4028-a365-64ec9f0350d7",

"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [

{ "type": "Feature", "properties": { "perimeter": 0.0, "community": "DOUGLAS", "shape_len": 31027.0545098, "shape_area": 46004621.158100002, "area": 0.0, "comarea": 0.0, "area_numbe": "35", "area_num_1": "35", "comarea_id": 0.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -87.609140876178913, 41.84469250265397 ], [ -87.609148747578061, 41.844661598424025 ], [ -87.609161120412566, 41.844589611939533 ]…


Next, we can get crime data from the same site. This very large file contains diverse information on every crime occurring in Chicago since 2001 including the location, type of crime, date, community ID etc. To reduce the size of this file with over 6 million rows, we first run a quick PDI transformation that will group the crimes by year and community.


We need to join the crime dataset to the GeoJSON dataset we created as only that dataset contains the names of the community areas. Both datasets do share a common geographic code, allowing us to blend the data by community areas ID.


For this C*Tools dashboard, we will use another PDI transformation as the source of the data. As will be seen later, when the user selects a year on the dashboard, this transformation will be triggered to get the total number of crimes for every Chicago community area in a given year.  We will also get the max number of crimes for all neighborhoods, to be used later in the color gradient.


With this transformation, we now have a set of data that includes key metrics along with their corresponding geographic areas represented as polygons.


Create the Dashboard with NewMapComponent


Now to create the actual map in Pentaho, we will use the C*Tools Community Dashboard Editor (CDE). This will be done using the NewMapComponent under the Components View.


This powerful component utilizes a mapping engine that renders the world for free (choosing between OpenLayers or Google engines). Then, GeoJSON or KML is used to plot polygons above the map. In our case, the GeoJSON file outline the communities throughout Chicago will be mapped.


Much of the functionality will be used through JavaScript snippets in the Component Lifecycle steps, such as Pre-Execution and Post-Execution.


As a brief example here, we can load our GeoJSON file with the following JavaScript in our Pre-Execution stage:


// locate our GeoJSON file under resources. Note, the BA server does not
// recognize the .geojson extension, so we rename to .js
var getResource = this.dashboard.getWebAppPath() + '/plugin/pentaho-cdf-dd/api/resources';
var mapDef = '${solution:resources/geojson/community-areas-current-geojson.js}';

// here we pass in our GeoJSON file, and also specify the GeoJSON property
// to use as a polygon ID
this.shapeResolver = 'geoJSON';
this.setAddInOptions('ShapeResolver', 'geoJSON', {
     url: getResource + mapDef,
     idPropertyName: 'area_num_1'


This will plot the polygons, with 1 caveat. The NewMapComponent also takes a data source data source as input (see the properties tab). This data source must contain data that matches the IDs specified above in the GeoJSON file, and only those polygons for which data points with matching IDs exist will be rendered.


We can specify which columns from our data source to use as the ID in a snippet also in the Pre-Execution phase like so:

this.visualRoles = {
        id: 0,                            // communityarea
        fill: 1                           // crimes

Note, here we defined the column for id as the first column (index 0), and use the 2nd column (index 1) as the fill value (more below).


To load our pre-aggregated data and render it on our map, the Kettle transformation described above is used which takes a year parameter and then reads & filters the Aggregated-Crime-Counts.csv file.


This transformation ensures that the 1st column is the Community Area ID and the 2nd column is the # of crimes, to match our JavaScript above.


Finally, more JavaScript can be added to add additional dashboard features. For our heat map example, we want to vary the fill color based on # of crimes.


We've already linked the data with the code snippet above. The NewMapComponent has some defaults, but to ensure it works smoothly we can implement the fill function ourselves as follows, which is also implemented in the Pre Execution step:

// define the polygon's fill function manually based on the crimes/fill
// value incoming from the datasource
this.attributeMapping.fill = function(context, seriesRoot, mapping, row) {
        var value = row[mapping.fill];
        var maxValue = row[mapping.max];
        if (_.isNumber(value)) {
                 return this.mapColor(value,
                         0,                                        // min crimes
                         maxValue,                // max crimes from dataset in yr
                         this.getColorMap()     // a default color map of green->red


The above function validates the incoming fill/crimes column, and then tweaks the default color map (green to red) and maps all values on a scale of 0 to the max number of crimes in a year (in 2015, this number was 17,336; coming from the western community of Austin, seen below). All values between will be somewhere on the gradient.



Another very useful function that can be implemented within the NewMapComponent is the tool tip, that will highlight information about a community area, displaying the community name and the total number of crimes, when hovered by a mouse. This is implemented in the Post Execution, again utilizing JavaScript.


function tooltipOnHover() {
    var me = this;
        ** Define events for mouse move
        **/'mousemove',, function (e) {
                 if (!_.isEmpty(me.currentFeatureOver)) {
                         var modelItem = me.mapEngine.model.findWhere({
                                  .css('top', e.pageY -50)
                                  .css('left', e.pageX + 5)
                                  // html contained in popup
                    '<br>Total Crimes: ' + + ''
        });'movestart',, function (e) {
        });'moveend',, function (e) {




Effectively analyzing geospatial data, especially when used with enhancement tools like NewMapComponent can be a very powerful tool. From this basic example, we can better understand how crime exists across a very large city and how that spread has changed over time. Using the polygons allows us to better group the data in order to gain valuable insight.


This approach is heavily indebted to Kleyson Rios's NMC-samples repository, which has similar examples and can also be zipped & uploaded for exploring the NewMapComponent.


The code for this example can be found on GitHub.

This post originally published by Chris Deptula on Tuesday, October 27, 2015


I recently attended the Strata-HadoopWorld conference in NYC.  I have been attending this conference for the past few years and each year a theme emerges.  A few years ago it was SQL on Hadoop, last year was all Spark.  This year there was a lot of buzz about streaming and continuous data ingestion. Although often presented as different topics, the line between these concepts is blurring. 


To simplify, streaming means working with queues of events that are constantly flowing via a messaging system. Continuous ingestion is used more broadly and is the process of continuously ingesting from any data source, whether that be processing a file as soon as it is received, monitoring the database for new inserts, or streaming from a message queue.


One of the challenges with continuous ingestion is making data immediately available for analytics in a performant manner. For smaller amounts of data this can often be done using RDBMS’; however, for Big Data more complexity is required. Cloudera is trying to solve some of these challenges with Kudu; however, Kudu is still in beta and should not be used in production. Shortly before Strata-HadoopWorld, I started working on a demo for PentahoWorld using an architecture that solves this problem.


Before we go any further, let’s take a step back and explain the use case for our Pentaho World demo of real time analytics. I wanted to build a system that ingested a stream of tweets about the conference in real time while also making them available for analytics. The analytics are presented using a Pentaho CDE dashboard and Mondrian cube for slicing and dicing in Pentaho Analyzer. For the dashboard I performed a bit of Natural Language Processing on the tweets to determine their sentiment, and then made both aggregate metrics and individual tweets available as soon as the tweet was received. For the Pentaho Analyzer slice and dice capabilities, I allowed the data to be slightly older (not real time), but did update the underlying cube frequently.


How did I do this? At a high level I built a Pentaho Data Integration (PDI) transformation that acted as a daemon. This transformation used the Twitter Streaming API to get a continuous stream of tweets with the #PWorld15 hashtag, and does not stop processing new data until the transformation is manually stopped. I then used the Stanford Sentiment Analysis plugin for PDI to calculate the Wilson Sentiment of the tweet, before writing this data into MongoDB.


Once the data was in MongoDB, I built my Pentaho CDE dashboard. Pentaho CDE has the ability to source data from PDI transformations.  I built a few PDI transformations that used the MongoDB input step with Aggregate Pipelines to calculate the metrics I needed. Using Pentaho transformations as the input saved me from having to write a lot of JavaScript. Then it was simply a matter of telling the CDE components to refresh every 10 seconds and I had built my dashboard!


For the cube, Pentaho Analyzer requires a SQL interface to the data. It is true, that this could be solved by using PDI’s data service capability, which allows you to use SQL to query the results of a PDI transformation. However, the “PDI as a data service” option has a limitation in that the output data from the transformation must be able to fit in memory -- not practical for Big Data. Instead, I extended the transformation that I used to stream the data from Twitter into MongoDB to also write the data to Hadoop/HDFS. This enabled me to expose the data via Impala into the SQL interface Pentaho Analyzer requires.


Brilliant!  But, there was a catch. In Hadoop, a file is not readable until the file handle has been closed. If I never close the file because my transformation is constantly running, then the data would never be available to query. I could have written a file for every record, but this would have resulted in way too many small files and would have been detrimental to performance. Instead, I used the Transformation Executor step in PDI to group the tweets into 10 minute chunks, and then wrote one file every 10 minutes.


Once I had all of this figured out and working, I tested and encountered yet another problem. Tweets have new lines and lots of special characters. Storing the data in a plain text file was going to be a challenge. The good news is there are better file formats for Hadoop than plain text files, such as Avro and Parquet. So, instead of using text files, I used the Pentaho Consulting Labs Parquet Output step. This had the dual benefit of easily handling new lines and special characters while also improving the underlying performance of Cloudera Impala queries.


Finally, I created a few Impala tables and a Mondrian cube, and solved my problem. I was able to stream tweets in real time through Pentaho to make them immediately available in a dashboard, and, with a modest delay, also have them available for slice and dice analytics.




This post originally published by Kevin Haas on Tuesday, July 14, 2015


When working with our clients, we find a growing number who regard their customer or transaction data as not just an internally leveraged asset, but one that can enrich their relationships with customers and supporting partners. They need to systematically share data and analytics outside their firewall.


One of the ways we help them achieve this goal is with Custom Data API's. In this article, we'll show you how to build Data API's using visual programming within the Pentaho Data Integration Interface. Along the way, we'll expose capabilities of the platform that are not always seen by the average user.


Inspecting The Field: Inside Pentaho Data Integration

For those who use Pentaho's Data Integration (PDI aka Kettle) platform regularly, you're probably aware of the amazing things you can do. For the uninitiated, PDI is a very powerful Extract-Transform-Load (ETL) technology that lets you access data from virtually any source, perform complex transformations, and push/load that data to virtually any destination. It's commonly used for integrating and shaping data into new forms optimized for analytical purposes (data warehouses, data marts, etc.).


However, Pentaho experts know that Kettle is more than an ETL platform. It is powered by an amazing, open, and flexible data management platform that can not only read and write data from relational databases, Hadoop, NoSQL databases, and Web API's, but it can also serve data to other applications.


Enter one of the most powerful (but maybe underutilized) components of the Pentaho Data Integration Platform: the Data Integration Server. It can power your API's.


Sowing The Seeds: The Data Integration Server

If you're an Enterprise Edition Pentaho customer, you're probably aware of the capabilities of the Data Integration Server. It's one of the foundational reasons why customers choose the Enterprise Edition of PDI. The Data Integration Server is a server application that allows you to schedule and run jobs and transformations created with PDIs "Spoon" visual development tool.


Community Edition users have a similar (but not as sophisticated) version of the Data Integration Server with the included Carte server. If you're a community edition user, it's easy enough to start up your own local Carte server on a port of your choosing. You could start Carte on port 8081 like this: 8081


carte.bat 8081


There are many configuration options for the Data Integration Server and Carte server that allow it to operate in all kinds of forms. See here for the current list of options.


Bringing It To Life: Calling The Transformation on the Server with ExecuteTrans

The first step to create a Data API is to develop a Pentaho Data Integration transformation that gathers data from virtually any source: RDBMS, Hadoop, NoSQL Databases, Web APIs, etc. Since transformations can be parameterized, you can apply filtering, sorting or any other customizable logic to your code. Again, this can all be developed using PDIs graphical development tool: Spoon.


Once you have installed and launched your Data Integration or Carte server, you can execute your transformations with a simple HTTP call using the ExecuteTrans method. For example, if you have Carte running on port 8081 and want to execute a transformation that is stored in /srv/pentaho/sample.ktr and accepts a parameter named "parm", you can simply call:


When you launch this URL within a browser, you wont see a lot of information in return. In fact, you'll get nothing back! But, if you look at the PDI logs on the server, you'll see that the sample.ktr transformation did run. This is all well and good -- you were able to launch a transformation on a server via an HTTP request. But, now you want it to return some data!


Growing a Transformation into an API: The Output To Servlet Option

To provide an output, the transformation must produce one or more rows of data through a configured output step. To send rows of data out as a response from the ExecuteTrans method, your transformation must pass this data to a Text File Output (or JSON Output) Step, configuring the step to "Pass output to servlet."



By checking the "Pass output to servlet" box, the transform will take the stream of data entering this step and push it out as your response. Note that the exact data contents of the stream are output as specified by the steps formatting options (e.g. fields, headers, separators). You'll want to configure depending on your needs.


Bringing it to Market: A Taste of Farmers Market Data In A Custom Data API

Lets look at an example. With harvest season upon us, more and more farmers markets are filling up with their bounty. But where can you find local farmers markets? Thankfully, the USDA provides an API that lets you search for nearby markets. 


This is a powerful set of data, but its not always easy to use the API (or any API, as provided). In this case, to get a market, you first provide a zip code. The API returns a list of markets with their associated IDs, which you then must use to query the API again in order to get detailed information on each market. Also, we'd like to handle the result data using JSON format and some elements of the API return comma separated values. Finally, we need to format the returned address a bit better, attaching latitude and longitude for mapping.


To begin, we'll create a transform that accepts a parameter called "zip", and does the following:


1. Generate Rows: Initialize with a row for Pentaho to process input parameters.

2. HTTP Client: Call the USDA API to get the list of markets with the input parameters.

3. JSON Output: Convert the JSON that comes back into a list of markets.

4. JSON Output: Extract the Market IDs from that list of markets.

5. Calculator: Format a URL string to re-query the API to get detailed market info for each market.

6. HTTP Client: Call the USDA API again for each of the markets to get the details.

7. JSON Output: Extract the detailed address info on the market.

8. HTTP Client: Call the Google Geocoding API to get the latitude and longitude.

9. JSON Output: Extract the latitude and longitude from Google's response.

10. Javascript: Clean up and format the Address and Market info from the source APIs into a nice JSON format.

11. Group By: Group all of the detailed Markets from the API into a single row.

12. Javascript: Format the final JSON.

13. Text File Output: Output the single data element as a response to the servlet to show in the server.



A lot of steps, for sure, but their configuration is relatively simple. This is what the transformation looks like in the Spoon IDE:


Once we have developed and tested our transformation using Spoon, we are ready to use our Carte server and the ExecuteTrans method. We open a browser and enter the following URL: 


Note, we can even front end the Carte or DI server with Apache and do a little URL rewriting to make it a friendlier URL to call. We did this. So instead of the above executeTrans URL, we can call this:


When you execute the API, it does all the work for you. It calls all the various child APIs, does all the cleanup and formatting, and returns delicious looking JSON, with all of the added items from the Transformation as a result. For example:


          marketname:"Eli's Cheesecake and Wright College Farmer's Market",
          location:"6701 W. Forest Preserve Drive, Chicago, Illinois, 60634",
          address1:"6701 West Forest Preserve Drive",
             "06/12/2015 to 09/25/2015 Thu: 7:00 AM-1:00 PM"
             "Baked goods",
             "Crafts and/or woodworking items",
             "Cut flowers",
             "Fresh fruit and vegetables",
             "Canned or preserved fruits, vegetables, jams, jellies, preserves, dried fruit, etc.",
             "Maple syrup and/or maple products",
             "Soap and/or body care products"


Taking It Home: Building Your Own Data APIs

This is a fun illustration of how to pull data from various sources and craft it together into your very own API. All of this was done with with out-of-the-box Pentaho Data Integration capabilities, no extra parts required.


Clearly, you're not limited on what you can do with this capability. You can easily get data from virtually any source and wrap a custom data API around it, giving you new capabilities to leverage your existing infrastructure, and share data in new ways. Pentaho has a lot of capabilities inside. We know that once you take PDI home from the market and look inside the box, you'll be inspired.


If you're interested in seeing how the transformation works, including a look at the transformation that powers the example, just let us know.

This post originally published by Bryan Senseman on Wednesday, October 15, 2014


I'm a huge user of Mondrian, the high speed open source OLAP engine behind Pentaho Analyzer, Saiku, and more. While the core Mondrian engine is wonderful, there are times when it doesn't do what I need it to, or exactly what I expect it to. Take this special case that I've encountered at some of my OEM customers, a case I call "Abnormal Genealogy."


Mondrian Needs Certain Data To Show Hierarchies

Mondrian has multiple ways to define hierarchies; traditional, ragged and parent-child. This is the feature that allows you to roll up and roll down metrics for complex parent and child record combinations. Think organization hierarchies, product hierarchies and employee hierarchies. In many cases, you have the IDs and data to support this cleanly, the way Mondrian expects, where parents have IDs that are less than (e.g. 3 < 14) than their associated children IDs.


But having data the way Mondrian expects it isn't always the case, particularly with OEM customers. There can be cases where the IDs of the child are less than than IDs of the parent. For example, my latest client preloads their system with standard/common "items" and then allows the end customer to create new parents (groupings) of these items. As a result, the preloaded items end up with IDs that are less than the newly added, customer-specific parent items. This reasonable load pattern breaks the Mondrian assumption that parent IDs are always less than child IDs.


Parent-Child ID's Need to Follow Certain Rules

When I first saw this, I thought "This is an edge case." But then I saw it again. And again. Clearly, I had to come up with a repeatable way to solve this problem. Fortunately, the ubiquitous FoodMart database can be used to replicate this issue. It contains an employee hierarchy with parent-child relationships that are pre-populated to fit the Mondrian assumption: all parent IDs are less than child IDs.  


Out of the box, FoodMart works correctly, but what happens if we simulate the replacement of a supervisor? We need to insert a new supervisor record in the employee table. This results in the new supervisor record receiving a higher employee_id than the employees which will report to it. The simple assignment of IDs for new records causes this peculiarly common case: abnormal genealogy!


To demonstrate this FoodMart scenario, let's look at a Pentaho Analyzer report showing Senior Management and some reporting data before the supervisor replacement:



Sheri Nowmer, inclusive of herself, has 10 Senior Managers assigned to her,  which includes Roberta Damstra.   All is well and good since the IDs of the parent records (supervisors) are less than the IDs of the child records (managed employees). Now, let's say Roberta retires to sunny Orlando and a new VP is hired, Mr. New Guy, on 1/1/1998. A few simple SQL edits and a Mondrian cache clearing and we should be good to go. We'll assign Mr. New Guy's employee_id with the value 1206. 1206 is much larger than any other employee_id in our database so it will easily stick out when we are reviewing the data. It also follows the usual pattern of a new employee record being assigned an ID which is greater than the maximum ID used.


Running the same report again after clearing the Analyzer and Mondrian caches, provides us with this view:



As expected, Roberta is no longer listed. However, Mr. New Guy also does not appear in this view. We confirm that we inserted his record so that he reports to Sheri Nowmer and that all of Roberta's employees now report to him. So, what happened to Mr. New Guy? His "abnormal genealogy" is causing internal issues for Mondrian. That is, the fact that his ID is greater than the IDs of his reports is causing Mondrian to incorrectly traverse the reporting hierarchy.


Solving the Abnormal Geneaology Problem

So, how can we solve this problem? We could dig into the Mondrian source code. It's open source afterall. But, I'd prefer to leave revising the core of the Mondrian engine to the java wizards at Pentaho and in the community. Perhaps, we can pull off a bit of Pentaho Data Integration (PDI) magic with some minor data structure changes to overcome this gap faster and easier.


The solution here is to use PDI to generate an extra set of ordered parent-child IDs. Using PDI, we need to identify the correct hierarchical order and make this order available to Mondrian. The Transform shown below demonstrates the steps I used to re-sequence the IDs to the correct genealogical order. Using this approach, parents consistently have smaller IDs than all of their children “making Mondrian happy without the need to modify its core code." Since the solution is data-centric we're also better protected for future versions of Mondrian.



So where exactly is the "magic" needed to resequence the IDs? It's in the use of the Closure Generator step, utilizing the existing parent and child IDs (which works regardless of genealogy). In this case, we take the base data and then reorder the ultimate parent rows by distance, as provided from the Closure Generator. Notice (below) that the Mr. New Guy, employee_id number 1206, is now the ninth "oldest" employee, perfect for a parent!



The next part of the solution is to update the dimension table with both ordered IDs as they will be needed in the Mondrian hierarchy configuration. If you're maintaining this with a PDI Dimension Lookup/Update step, I suggest using the Punch Thru option for these two fields. 


The final bit of data engineering requires you to create a tweaked version of the closure table that uses the ordered parent and original child IDs. This allows the correct groupings for parent aggregation while linking back into the original fact table with the actual IDs.



Now that our data engineering tasks are complete, all we need is a small adjustment of  the hierarchy definition in the Mondrian schema. We simply modify the level definition to utilize the new ordered fields. These changes result in the following dimension definition. (Note we create new tables called employee2 and employee_closure_cube in order to maintain the original foodmart structures in case we want to revert to the out-of-the-box configuration later.)



Back to Analyzer we go, refreshing both caches and...sure enough; Mr. New Guy is official!



When I first encountered this problem, even Google wasn't able to find an effective solution. I only found a few discussions of the problem in semi-related jiras: and Given my experience with hierarchical data, I expect that this is a fairly common case. I hope you find this approach helpful.

This post originally published by Chris Deptula on Wednesday, November 19, 2014.


Many of you requested more information on the inner workings of the Sqoop component. Perhaps the best way to explain is via "lessons learned". Here goes...


Use the split-by Option

Sqoop is primarily used to extract and import data from databases into HDFS and vice versa. This was exactly what HDI needed to do. So, I pulled up the Sqoop user manual for a quick refresher. I quickly found the import-all-tables Sqoop option. It seemed like the perfect solution. One command and I can extract every table from the datamart to HDFS. Great! So I tried it...


Almost immediately, Sqoop failed. It turns out that Sqoop, by default, extracts data from tables in chunks, and the chunking is done using each table's primary key. If a table has a single column primary key the extract works. If no primary key exists, it fails. In our experience, most dimension tables follow this design pattern -- using a single-column surrogate key. However, many, if not most, fact tables do not have a primary key, much less one stored in a single column. This was the case in our test database. When Sqoop tried to extract a fact table, the job failed. Time to start over.


It turns out Sqoop has a split-by option specifically designed to override the primary key default. Unfortunately, the split-by option cannot be used with import-all-tables since the split-by column will vary by table. This meant that HDI would have to generate table-specific import statements, each with their own split-by column. Simple enough. I could use Pentaho Data Integration to interrogate the database metadata and generate each table's specific Sqoop import command. Alas, once I began designing, I soon discovered complexity.


When using the split-by option, you should choose a column which contains values that are uniformly distributed. If there is a skew in the column data, then the chunks being processed will also be skewed causing overweighted import job tasks to potentially fail due to lack of compute resource (usually memory). Thus, picking the spit-by column required knowledge of table content that an automated process may not have. We'd have to make educated guesses.


Like the import-all-tables option, the easiest rule to apply is to use a single column primary key if it exists. We'd have to assume that the column is a surrogate key that was populated with a sequence of values. If no primary key column exists, we look for a date column. Our educated guess is that dates represent events and events tend to be more uniformly distributed than not. If no date column exists, we look for an integer column. If no integer, then we simply grab the first column we can find. Clearly this system is "version 1", with much potential for optimization. We have some ideas and plan on iterating the design in the coming months.


With the acknowledged fuzziness of our split-by algorithm, we knew that we'd have to enable HDI configuration so that it only generated code. You can use HDI to interrogate your source database metadata, generate the Sqoop import commands and, then, optionally execute those commands. This gave us more control to manually customize and execute the generated import commands.


Next, with my generated Sqoop import commands, I tried to run them.


Manage Database Fetch Size

The imports were working. I had loaded several dimensions and the first fact table. However, while importing my second, larger fact table, the MapReduce job failed with a GC Overhead limit exceeded error message. After a lot of Google searching and head scratching I determined that the source MySQL database was reading all the rows in one fetch and trying to store everything into memory. I had to tell MySQL to return the data in batches. The following parameters on my jdbc connection string did the trick.





My issue and solution were specific to MySQL, but I suspect that the same problem could occur for other RDBMS and jdbc drivers. You've been warned.


Convert Dates to Strings

With the fetch problem solved, my Sqoop imports finally completed. My next task was to define Hive tables on the newly loaded data and attempt some queries. All was fine until I selected some date fields. They were returned integers instead of dates...ugh! Being an experienced programmer, I quickly realized that the integers were "seconds since epoch." -- which most will agree is not a user friendly way to present dates! I dug through the Sqoop code and recognized that my date values were being treated as timestamps. To prevent Sqoop from converting dates to seconds since epoch, I had to manually map the datatype for each date field. I used Sqoop's map-column-java option to map date columns to the java String class. The generated Sqoop command had to include an option like this:


--map-column-java my_date=String


Once I made the change, the dates were output and queryable in yyyy-MM-dd HH:mm:ss.S format instead of seconds since epoch.



It took quite a bit of trial and error, but I was finally able to develop a Hadoop Datamart Importer. The core Sqoop import command generation module works like a champ! If you would like to learn more about how Pentaho Consulting's Data Engineering services can help you with your Hadoop data ingestion needs, give us a call.

This post originally published by Chris Deptula on Tuesday, February 24, 2015.


This is the third in a three part blog on working with small files in Hadoop.


In my previous blogs, we defined what constitutes a small file and why Hadoop prefers fewer, larger files. We then elaborated on the specific issues that small files cause, specifically inordinate NameNode memory usage and MapReduce performance degradation. We then began exploring common solutions to these problems. In this blog, I’ll close this series by examining some less commonly used alternatives for solving the MapReduce performance problem and the factors to consider when choosing a solution.


Solving the MapReduce Performance Problem

In my previous blog, I listed the following solutions for mitigating the MapReduce performance problem:

• Change the ingestion process/interval

• Batch file consolidation

• Sequence files

• HBase

• S3DistCp (If using Amazon EMR)

• Using a CombineFileInputFormat

• Hive configuration settings

• Using Hadoop’s append capabilities


My second blog already discussed changing the ingestion process, batch file consolidation, and sequence files. I will cover the remaining options here.



If you are producing numerous small files, storing your data as files in HDFS may not be the best solution. Instead you might consider using an HBase column store. Using HBase changes your ingestion process from producing many small HDFS files to writing individual records into HBase tables. If your data access pattern is based on well-defined, random-access lookups, HBase may be your best alternative. It is architecturally tuned for high-velocity data record inserts, high-volume, individual record lookups and streaming based analytics. However, if your data access pattern tends toward full file/table scans, then HBase may not be optimal.


It is possible to create a Hive table that maps to HBase data; however, query performance in this design will vary. Hive on HBase will shine when a single row or a range of rows is selected, but if your queries tend toward full table scans HBase is very inefficient. Most analytical queries, especially those using group bys, require full table scans.


HBase provides the best ability to stream data into Hadoop and make it available for processing in real time. However, balancing the needs of HBase with other cluster processes can be challenging and requires advanced system administration. Additionally, HBase performance is largely dependent on data access patterns and these should be carefully considered before choosing HBase to solve the small file problem.



This solution is only available for users of Amazon EMR. Amazon EMR clusters are designed to be short lived, persisting their data in Amazon S3. Even with Amazon S3, processing a large number of small files still results in launching more map tasks than necessary -- decreasing performance. Enter S3DistCp...


S3DistCp is a utility provided by Amazon for distributed copying of data from S3 to ephemeral HDFS or even other S3 buckets. The utility provides the capability to concatenate files together through the use of groupBy and targetSize options. This is useful when you have thousands of small files stored in S3 that you want to process using Amazon EMR. S3DistCp kills two birds with one stone by concatenating many small files and making them appear in faster, ephemeral HDFS storage. There have been reports of as much as 15x performance improvement using this mechanism.


For all practical purposes S3DistCp does the same task as the batch file consolidation approach I mentioned in my previous blog. If using Amazon EMR, note that you have a pre-built tool that accomplishes this task.


Using a CombineFileInputFormat

The CombineFileInputFormat is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them “on the fly” for consumption by a single map task. You gain the benefits of not launching one map task per file and not requiring that multiple files be merged into a single persisted file as part of a preparatory step. This solves the problem of a MapReduce job launching too many map tasks; however, since the job is still reading in multiple small files random disk IO is still a problem. Additionally, most implementations of the CombineFileInputFormat do not account for data locality and will often pull data from a variety of data nodes over the network.


In order to implement this, the CombineFileInputFormat must be extended in Java for different file types. This requires significant development expertise to develop your custom input format classes. However, once written, you can configure a maximum split size and it will merge files until this size is met.


Note that since the merged data is not persisted in HDFS, the CombineFileInputFormat does not alleviate the NameNode memory problem. A good example of an implementation of a CombineFileInputFormat may be found here.


Hive Configuration Settings

If you notice that Hive creates small files in your Hadoop cluster through “create table as” or “insert overwrite” statements, you can adjust a few Hive specific configuration settings to mitigate. When used, these settings tell Hive to merge any small files that were created into larger files. However, there is a penalty. Hive will launch an additional MapReduce job, post-query, to perform the merge. Further, the merge is done before Hive indicates to the user that the query has finished processing instead of occurring asynchronously.


It should be noted that these settings only work for files that are created by Hive. If, for example, you create files outside of Hive using another tool such as Sqoop to copy into the Hive table using a hdfs fs –mv command, Hive will not merge the files. Therefore, this solution does not work when the files ingested into Hadoop are small. This solution is only recommended in Hive-centric architectures where a small performance penalty in insert overwrite and create table as statements is acceptable.


The settings to be used are:



Using Hadoop’s Append Capabilities

I would be remiss if I did not mention this final option. Invariably, when I discuss how to handle small files, I hear the question “I was reading a blog and saw that Hadoop has the ability to append to files. Why can’t we just append to existing files?”


The story of appends in Hadoop is rather rocky. Append was added in July of 2008 as part of Hadoop 0.19. However, after implementation (as early as October 2008) many issues were found and append was disabled in 0.19.1. However, in order to support HBase without risk of data loss append capabilities were added back to Hadoop in 0.20.2. So, finally, after 0.20.2 it was technically possible to perform appends in Hadoop.


Append may be available, but none of the major tools in the Hadoop ecosystem support it: Flume, Sqoop, Pig, Hive, Spark, and Java MapReduce. MapReduce enforces a rule that the output location of a MapReduce job must not exist prior to execution. Due to this rule it is obviously not possible for MapReduce to append to pre-existing files with its output. Since Sqoop, Pig, and Hive all use MapReduce under the covers it is also not possible for these tools to support append. Flume does not support append largely because it operates under the assumption that after a certain period either in terms of seconds, bytes, number of events, or seconds of inactivity, Flume will close the file and never open it again. The Flume community has deemed this sufficient and not demanded append support.


If you truly must use appends in Hadoop, you have to write your own system to perform the ingestion and append to existing files. Additionally, if any of your in-cluster processing requires appending to existing files, you will not be able to use Spark or MapReduce. Therefore using HDFS append capabilities is very complex and should only be used by the most technologically savvy organizations. Without a significant engineering team and support commitment, this option is not recommended.


Choosing a Solution

Choosing the best solution for working with small files depends on a variety of questions. It may be necessary to use a combination of these solutions based on access patterns and data requirements. The questions that should be considered include:

• At what point in the data flow are the small files being generated? Are the small files being created at ingestion, or via in-cluster processing?

• What tool is generating the small files? Can changing tool configuration reduce the number of small files?

• What level of technical skill exists within your organization? Do you have the capabilities to maintain input formats or writing your own ingestion engine?

• How often are the small files being generated? How often can small files be merged in order to create large files?

• What sort of data access is required to these small files? Do the files need to accessible through Hive?

• What type of administrative periods might exist where processes can be run inside the cluster to alleviate small files?

• What level of latency is acceptable from MapReduce processes?


Once you answer these questions, then you can evaluate and choose the best options.

This post originally published by Chris Deptula on Wednesday, February 18, 2015


This is the second in a three part blog on working with small files in Hadoop. In my first blog, I discussed what constitutes a small file and why Hadoop has problems with small files. I defined a small file as any file smaller than 75% of the Hadoop block size, and explained that Hadoop prefers fewer larger files due to NameNode memory usage and MapReduce performance. In this blog, I will discuss solutions to these challenges when small files are truly unavoidable.


Solving the NameNode Memory Problem

As discussed in my previous blog the metadata for every block in Hadoop must be stored in the NameNode's memory. This leads to a practical limit on how many objects can be stored in Hadoop and also has implications on startup time and network bandwidth. There are two solutions to this, decrease the number of objects in your Hadoop cluster or somehow enable greater memory use by the NameNode -- but without causing excessive startup times. The most common approaches to solve this memory problem involve Hadoop Archive (HAR) Files and Federated NameNodes.


Hadoop Archive Files

Hadoop archive files alleviate the NameNode memory problem by packing many small files into a larger HAR file, similar to TAR files on Linux. This causes the NameNode to retain knowledge of a single HAR file instead of dozens or hundreds of small files. The files within a HAR file can be accessed using the har:// prefix instead of hdfs://. HAR files are created from files that exist in HDFS. Therefore, a HAR file can consolidate both ingested data as well as data created through normal MapReduce processing. HAR files can be used independent of the technology used to create the small files. There is no common dependency other than HDFS.


Although HAR files reduce the NameNode memory footprint for many small files, accessing and processing HAR file content will likely be less efficient. HAR files are still stored randomly on disk and reading a file within a HAR requires two index accesses -- one for the NameNode to find the HAR file itself and one to find the location of the small file within the HAR. Reading a file in a HAR may actually be slower than reading the same file stored natively on HDFS. MapReduce jobs compound this performance issue as they will still launch one map task per file within the HAR.


In the end, you have a trade-off, HAR files can solve the NameNode memory issue, but may worsen processing performance. If your small files are primarily kept for archival purposes, with infrequent access, then HAR Files are a good solution. If the small files are part of your normal processing flow, you may need to rethink your design.


Federated NameNodes

Federated NameNodes allow you to have multiple NameNodes in your cluster each storing a subset of the object metadata. This eliminates the need to store all object metadata on a single machine, providing more scale for memory usage. On the surface, solving the small file memory problem with this technique is attractive, but with a little more thought you'll quickly realize the limitations.


Federated NameNodes isolate object metadata -- only one NameNode knows about any particular object. This means to get a file you must know which NameNode to use. If your cluster houses multiple tenants and/or siloed applications, then federated NameNodes are a natural fit -- you can isolate object metadata by tenant or application. However, if you are sharing data across all applications within a cluster, this approach is not ideal.


Since federation does not actually change the number of objects or blocks within your cluster, it does not solve the MapReduce performance problem. Conversely, federation adds significant and often unnecessary complexity to your Hadoop installation and administration. Federation, when used to solve the small file problem, is often more of a mechanism to hide the small file problem.


Solving the MapReduce Performance Problem

As discussed in my previous blog, the MapReduce Performance problem is caused by combination of random disk IO and launching/managing too many map tasks. The solution seems obvious -- have fewer, larger files or launch fewer map tasks; however, this is often easier said than done. Some of the most common solutions include:

  • Change the ingestion process/interval
  • Batch file consolidation
  • Sequence files
  • HBase
  • S3DistCp (If using Amazon EMR)
  • Using a CombineFileInputFormat
  • Hive configuration settings
  • Using Hadoop's append capabilities

We will discuss the first three of these options in this blog, and cover the remainder in the third and final blog of the series.


Change the Ingestion Process/Interval

The easiest way to get rid of small files is simply not to generate them in the first place. If your source system generates thousands of small files that are copied into Hadoop, investigate changing your source system to generate a few large files instead, or possibly concatenating files when ingesting into HDFS. If you are only ingesting 10 MB of data every hour, determine if it's possible to only ingest once a day. You'll create 1x240MB file instead of 24x10MB files. However, you may not have control over the source system creating the files or business needs require that you ingest data at interval frequencies such that small files are unavoidable. If small files are truly unavoidable then other solutions should be considered.


Batch File Consolidation

When small files are unavoidable, file consolidation is most common solution. With this option you periodically run a simple, consolidating MapReduce job to read all of the small files in a folder and rewrite them into fewer larger files. If you have 1000 files in a folder, and specify only 5 reduces for the MapReduce job, the 1000 input files will be merged into 5 output files. Followed by some simple HDFS file/folder manipulation, you have reduced your memory footprint by 200:1 and, likely, improved the performance of future MapReduce processing on the same data.


This can be implemented in as little as 2 lines of Pig, a load and a store statement. For example, if consolidating text files:



Implementing this in Hive or Java MapReduce is equally as easy. These MapReduce jobs obviously require cluster resources while executing, and are often scheduled during off hours. However, they should be run frequently enough so the performance impact of small files does not become too extreme. Additional logic is often built into these jobs to only merge files in folders that will have a noticeable performance impact. Merging files in a folder that only contains three files will not result in as great of a performance benefit as merging files in a folder that contains 500 small files.


Checking folders to determine which folders should be consolidated can be accomplished in a variety of ways. For example a Pentaho Data Integration job can be used to iterate through a group of folders in HDFS, finding those that meet a minimum set of requirements for consolidation. There is also a pre-written application designed specifically for this task called File Crush, an open source project written by Edward Capriolo. File Crush is not professionally supported, therefore no guarantees exist that it will continue to work with future versions of Hadoop.


Batch file consolidation does not maintain the original file names. If having the original file name is important for processing or understanding where the data originated, batch file consolidation will not work. However, most HDFS designs embed naming semantics at the folder level and not within each file. Adopting this practice removes file name dependencies as an issue.


Sequence Files

When there is a requirement to maintain the original filename, a very common approach is to use Sequence files. In this solution, the filename is stored as the key in the sequence file and the file contents are stored as the value. The table below gives an example of how the small files would be stored in a sequence file:



If you have 10,000 small files your sequence file would contain 10,000 keys, one per file. Sequence files support block compression, and are splittable meaning that MapReduce jobs would only launch one map task per 128MB block instead of one map task per small file. This works very well when you need to maintain the input file name, and you are ingesting hundreds or thousands of small files at the same time.


However, if you are only ingesting a small number of small files at a time the sequence file does not work as well because Hadoop files are immutable and cannot be appended to. Three 10MB files will result in a 30MB sequence file which is still, by our definition, a small file. Another challenge is that the retrieval of a list of file names within a sequence file requires processing the entire file.


Additionally, Hive does not work well with sequence files in this structure. Hive treats all of the data within a value as a single row. It would not be easy to use Hive to query this data as the entire contents of a file would be a single row within Hive. Finally, the Hive tables you create will not have access to the sequence file key, the filename, and will only have access to the value, the contents of the files. It may be possible to write a custom Hive serde to solve these challenges, but that is an advanced topic beyond the native capabilities within Hadoop.


More details about challenges and limitations of Sequence Files may be found in our Hadoop file formats blog post.



In this blog, we discussed the trade-offs of using Hadoop Archive (HAR) files to minimize NameNode memory usage. We discussed and dismissed the use of Federated NameNodes as a panacea for the small file problem. And, we introduced some common solutions for small file consolidation --- solutions which improve both NameNode memory usage and MapReduce performance. In my third and final blog, I'll discuss additional, less commonly used, but valuable techniques for file consolidation.



This post published by Chris Deptula on Wednesday, February 11, 2015


This is the first in a 3 part blog on working with small files in Hadoop. Hadoop does not work well with lots of small files and instead wants fewer large files. This is probably a statement you have heard before. But, why does Hadoop have a problem with large numbers of small files? And, what exactly does "small" mean? In the first part of this series I will answer these questions. The subsequent parts will discuss options for solving or working around the small file problem.


What is a small file?

A small file can be defined as any file that is significantly smaller than the Hadoop block size. The Hadoop block size is usually set to 64,128, or 256 MB, trending toward increasingly larger block sizes. Throughout the rest of this blog when providing examples we will use a 128MB block size. I use the rule if a file is not at least 75% of the block size, it is a small file. However, the small file problem does not just affect small files. If a large number of files in your Hadoop cluster are marginally larger than an increment of your block size you will encounter the same challenges as small files. For example if your block size is 128MB but all of the files you load into Hadoop are 136MB you will have a significant number of small 8MB blocks. The good news is that solving the small block problem is as simple as choosing an appropriate (larger) block size. Solving the small file problem is significantly more complex. Notice I never mentioned number of rows. Although number of rows can impact MapReduce performance, it is much less important than file size when determining how to write files to HDFS.


Why do small files occur?

The small file problem is an issue a Pentaho Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including:


  • Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
  • The source system generates thousands of small files which are copied directly into Hadoop without modification.
  • The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.


Why does Hadoop have a small file problem?

There are two primary reasons Hadoop has a small file problem: NameNode memory management and MapReduce performance. The namenode memory problem Every directory, file, and block in Hadoop is represented as an object in memory on the NameNode. As a rule of thumb, each object requires 150 bytes of memory. If you have 20 million files each requiring 1 block, your NameNode needs 6GB of memory. This is obviously quite doable, but as you scale up you eventually reach a practical limit on how many files (blocks) your NameNode can handle. A billion files will require 300GB of memory and that is assuming every file is in the same folder! Let's consider the impact of a 300GB NameNode memory requirement...

  • When a NameNode restarts, it must read the metadata of every file from a cache on local disk. This means reading 300GB of data from disk -- likely causing quite a delay in startup time.
  • In normal operation, the NameNode must constantly track and check where every block of data is stored in the cluster. This is done by listening for data nodes to report on all of their blocks of data. The more blocks a data node must report, the more network bandwidth it will consume. Even with high-speed interconnects between the nodes, simple block reporting at this scale could become disruptive.


The optimization is clear. If you can reduce the number of small files on your cluster, you can reduce the NameNode memory footprint, startup time and network impact.


The MapReduce performance problem

Having a large number of small files will degrade the performance of MapReduce processing whether it be Hive, Pig, Cascading, Pentaho MapReduce, or Java MapReduce. The first reason is that a large number of small files means a large amount of random disk IO. Disk IO is often one of the biggest limiting factors in MapReduce performance. One large sequential read will always outperform reading the same amount of data via several random reads. If you can store your data in fewer, larger blocks, the performance impact of disk IO is mitigated.


The second reason for performance degradation is a bit more complicated, requiring an understanding of how MapReduce processes files and schedules resources. I will use MapReduce version 1 terminology in this explanation as it is easier to explain than with Yarn, but the same concepts apply for Yarn. When a MapReduce job launches, it schedules one map task per block of data being processed. Each file stored in Hadoop is at least one block. If you have 10,000 files each containing 10 MB of data, a MapReduce job will schedule 10,000 map tasks. Usually Hadoop is configured so that each map task runs in its own JVM. Continuing our example, you will have the overhead of spinning up and tearing down 10,000 JVMs!


Your Hadoop cluster only has so many resources. In MapReduce v1, to avoid overloading your nodes, you specify the maximum number of concurrent mappers a node can process. Often the maximum number of concurrent mappers is in the 5 to 20 range. Therefore, to run 10,000 mappers concurrently you would have to have 500 to 2000 nodes. Most Hadoop clusters are much smaller than this, causing the JobTracker to queue map tasks as they wait for open slots. If you have a 20 node cluster with a total of 100 slots, your queue will become quite large and your process will take a long time. And don't forget, your job is likely not the only job competing for cluster resources.


If instead of 10,000 10MB files you had 800 128 MB files you would only need 800 map tasks. This would require an order of magnitude less JVM maintenance time and will result in better disk IO. Even though an individual map task processing 128 MB will take longer than a map task processing 10 MB, the sum of all of the processing time will almost always be orders of magnitude faster when processing the 800 larger files.


What can you do if you have small files?

Now that we have discussed what constitutes a small file and why Hadoop prefers larger files, how do you avoid the small file problem? In my next post, I will discuss solutions to the NameNode memory problem as well as some initial options for solving the MapReduce performance problem. In my third and final blog in this series, I will discuss additional solutions for the performance problem and how to choose the best solution for your situation.

This post was written by Dave Reinke and originally published on Wednesday, June 22, 2016


In a previous blog, we discussed the importance of tuning data lookups within Pentaho Data Integration (PDI) transformations.  We suggested that there were identifiable design patterns which can be exploited to improve the performance and stability of our code.  In this blog, we will investigate the most frequently used lookup pattern:  Key-based, Single Record Lookup.


This pattern’s classic use case involves an inbound stream of fact records queried from a source containing dimensional attributes in the form of business keys.  In order to load this data into the target fact table, the dimensional business keys must be converted into a set of technical keys referencing target dimension tables. The conversion of dimensional business to technical keys is done through a series of lookups.


The Database Lookup and Stream Lookup steps are used to implement this pattern.   Database Join could be used, but would be much less efficient since it does not support caching.   Additionally, we are not considering PDI’s purpose built Dimension Lookup/Update and Combination Lookup/Update steps as this pattern assumes no versioning of dimension data, i.e. no slowly-changing dimensions and that we have already loaded our dimension tables prior to the fact table load.  (I can already hear some of my colleagues’ arguments.   Yes, I know that Dimension Lookup/Update can be used for Type 1 dimensions and that its use enables the code to more easily migrate to Type 2, versioned, dimensions.  It is a candidate for this pattern, but requires specific database design changes for its use.   We’ll discuss Dimension Lookup/Update in a future blog.  I promise.)  So with that caveat, let’s examine the use of Database Lookup.

Using Database Lookup for Key-based, Single Record Lookups

Database Lookup is the prime choice for this pattern because it is easy to set up and provides rich functionality and tuning options.  However, there are two vital prerequisites for its use.  First, the lookup source must be a single database table.  Second, the lookup table must contain a unique key column, often a business key, and a technical key column, almost always the primary key. The following diagram shows an example of this step’s configuration.  It depicts a lookup of the technical product_key from the dim_product table in my_datamart database using a unique product_code business key field.


As configured, the step will generate and issue the following SQL statement to the my_datamart database:

select product_key from dim_product where product_code = ?


The ? is replaced by the value in the inbound stream’s product_code field.  


Since the step is configured to enable caching of lookup data, the SQL statement will be run only once for each distinct product_code value found within the inbound stream.  Additionally, we provide a default value of 0 should the lookup fail to find a match.  No-Match behavior and caching are both configurable.  Let’s discuss...

Handling No-Match Lookup Failure in the Database Lookup Step

In our example, a no-match would occur if an inbound stream record’s product_code did not find a matching product_code in the dim_product database table.   The step is configured to “not fail”, rather to place a default value of 0 in the outbound stream’s product_key field.  The underlying assumption is that there exists a  “null” or “unknown” dim_product database record with a 0 product_key for null or invalid product_code’s.  The following table describes the two ways a “no-match” can be handled by this step.



Step Configuration

Processing Implications

Provide a Default Value

Depicted in the example.

A subsequent step in the transformation can be used to filter and/or operate on the defaulted value, perhaps writing an error record and sending an email to an administrator.

Filter the inbound record

Check the "Do not pass the row if the lookup fails" box

No error is raised and the row is not passed to the outbound stream. This prevents you from taking corrective action on the errant data but ensures that only valid matches are processed.

In practice, we rarely use the filter option as most data programs attempt to remediate and/or alert if invalid data appears.


Optimizing the Lookup Cache for the Database Lookup Step

Not implementing or optimizing the lookup cache is a frequent mistake made by new PDI developers. Fortunately, it’s simple to explain and set up!  The following table details the configuration options:

Cache Option

Step Configuration

Processing Implications

Cache All

Check “Enable Cache”, then check “Load all data from table”

A single SQL select statement is executed, retrieving all key and lookup column data from the database Lookup Table. The entire Lookup Table dataset must fit in PDI’s memory. No inbound stream records will be processed until this entire dataset is retrieved and cached.

Cache Some

Check “Enable Cache”, uncheck “Load all data from table”, then specify how big (in # of lookup records) the cache will be allowed to get via the “Cache Size in rows” field.

One SQL select statement will be executed per distinct inbound record key that is not already found in the cache. PDI memory must be able to hold the number of lookup rows specified.

No Cache

Uncheck “Enable Cache”

Every inbound record will cause a SQL select statement to be executed against the Lookup Table -- even if keys are repeated!


So, when to use which option?   Here are some insights into our best practice decision-making...

Cache Option

When to Use

Cache All

  • The lookup table is not very big

Cache Some

  • The lookup table is very big, the inbound stream is not small and contains repeated lookup keys

  • The lookup table is big and the inbound stream is sorted on lookup keys (i.e. removing previously cached records from the active cache has no downstream penalty)

No Cache

  • The lookup table is very big and the inbound stream is very small

  • The lookup table is very big and the inbound stream contains very few (or no) repeated lookup keys


The most common configuration is Cache All. In general, if Cache All works efficiently, then use it. Most lookup tables are small (< 100,000 records) and thus can be retrieved quickly and will easily fit in memory. If full lookup dataset retrieval is inordinately slow or memory is limited, then consider Cache Some or No Cache depending on the size and lookup key distribution in your inbound stream.

Using Stream Lookup for Key-based, Single Record Lookups

Stream Lookup differs from Database Lookup in the following manner:

  • It’s lookup dataset is sourced from a second inbound stream of records and not tied directly to a single database table.

  • It always caches all lookup data -- there is no option for partial or no caching. 

Thus, Stream Lookup should be used in the following circumstances:

  • The lookup data is not sourced from the entirety of a single lookup table

    • You may want to use a subset of rows in a lookup table. In this case, you’d use a Table Input step to gather the subset and then stream that subset as the lookup data into the Stream Lookup step, or

    • the lookup data is sourced by joining two or more tables or streams.

  • You have multiple lookups against the same “not big” lookup table

    • You have role-playing dimension keys that relate a single fact record to the same dimension table multiple times. In this case, you’d use a Table Input step to query all of the lookup data and then use “copy” hops to stream that data to multiple Stream Lookup steps -- one per role.

  • You have a source of data that does not reside in a database table

    • E.g. a flat file, or a spreadsheet


Since the lookup data stream for Stream Lookup can be generated by any valid sequence of PDI steps, the alternatives and examples for its use are somewhat boundless.


Also note that Stream Lookup does not provide a filter record option for “no-match” conditions.   If you want to rid your stream of “no-match” records you have to use a Filter step after the Stream Lookup.


As you can see, there is tremendous opportunity to tune (or mistune!) lookups when doing a simple key-based, single record lookup.   Next in our series, I’ll examine a similar pattern but with a subtle twist:  Looking up the Most Recent Record.

This post was written by Dave Reinke and originally published on Wednesday, July 6, 2016


As we continue our series of Pentaho Data Integration (PDI) Lookup Patterns, we next discuss best practice options for looking up the “most recent record”. Common use cases for this pattern include finding the most recent order for a customer, the last interaction of a web site visitor, and the last claim record for a policy. As you’d expect there are multiple ways in which this can be done within PDI.


We’ll explore two of the most frequently used options, but first let’s define an example use case.

Example Use Case

Suppose we have a stream of customers each identified with a customer_key and there is an orders table which contains the following columns:





Primary key of the order. The column we want to lookup.


Foreign key to customers. The column we need to match with our input stream records..


The timestamp of the order.  We’ll use this to determine the most recent record for a given customer_key.

Other columns

There may be many other columns which we could include in our lookup.   We will ignore these in order to de-clutter our example.


If you had customer_key, 100, and wanted to get the most recent order_key, you would issue a query like the following:


select order_key

from orders

where customer_key = 100

order by order_timestamp desc

limit 1


The database would return exactly one order_key (limit 1) for the given customer_key (100) chosen as the first of a list sorted by the order_timestamp descending. In PDI, we can use the Database Lookup and Database Join steps to accomplish the same result.

Approach 1: Database Lookup

As previously discussed, the Database Lookup step is most often used for key-based, single record lookups, however it is also apt for the Most Recent Record lookup pattern. The following picture demonstrates how this step would be configured for our example.



The step looks up the order_key of the most recent record in the order_db.orders table for each input stream record’s customer_key. The Order by field tells PDI to sort the matching orders records by order_timestamp in descending order. By design, the step will only return one row -- the first row returned after the sort.   So, we have effectively configured the step to execute a query similar to the one listed above.


As configured, our example will execute a single lookup query for each input stream record. If customer_key’s are repeated in the input stream, this will result in many redundant query executions. To eliminate these redundant queries, caching can be configured for Database Lookup. Database Lookup enables caching because it restricts the lookup query to a single database table. The performance boost can be quite significant. In our example, suppose there are 1,000,000 input stream records containing 55,000 distinct customer_key’s. Without caching, the step will execute 1,000,000 lookup  queries. If we check the Enable Cache? box, the step will only fire 55,000 lookup queries -- one per distinct customer_key. This results in an order of magnitude or more in improved performance. 

Approach 2: Database Join

The Database Join step can also be configured to implement the Most Recent Record lookup pattern. The following picture shows it’s configuration for our orders example.



When using the Database Join step, you must supply your own lookup SQL. In our simple example, this is a straightforward parameterized select statement that returns a order_keys for a given customer_key in descending timestamp order -- most recent record first. We enter 1 in the Number of rows to return field to ensure that only the first record for each input stream record is passed to the output stream. We check Outer join? to allow a failed lookup record (i.e. no matching orders for a customer_key) to pass with a null order_key in it’s output stream record.


Performance-wise, the Database Join step will take about as long to process as a non-cached Database Lookup step. There will be one query per input stream record. Unlike Database Lookup, there is no ability to cache data to improve performance. However, because Database Join allows you to define your own SQL Select statement, you have the ability to define arbitrarily complex lookup queries.

Which Approach to Use?

So, how do you decide which step to choose? The answer is rather straightforward. Use Database Lookup step if you have a single lookup table which a straightforward lookup condition. Leverage caching if your input datastream contains duplicate lookup keys. Only use Database Join if your lookup query is complex. Remarkably, in the field, we see developers using to Database Join without realizing that Database Lookup can be leveraged to improve performance. Don’t make that mistake!


One final comment, the savvy reader will note that this same pattern applies to “least recent record” or historically first record lookups as well. In fact, any arbitrary record sort that you might apply can be handled by this design pattern.

Next in our series, I’ll examine a the Master-Detail explosion pattern causing the Database Join step to come to the fore. Visit our website in the next week or so, or follow us on Twitter @inquidia or Facebook to tune in to the rest of this series.

This post was written by Chris Deptula and originally published on Wednesday, January 28, 2015


With an immutable file system and no update command, how do you perform updates in Hadoop?   This problem occurs in just about every Hadoop project.   Yes, the most common use case of Hadoop data ingestion is to append new sets of event-based and/or sub-transactional data.   However, for analytics, you often have to join these events with reference data (hereafter, I will refer to as “dimension data”) -- data that can and usually does change over time.


Imagine you are collecting website activities through an intelligent event tracking system.  The events are generated as JSON datums.  You’ll likely have some automated collection mechanism that uses streaming or micro-batching ingestion into HDFS.  Simple enough, just append the new data.   Now your business users want to analyze these events by various web user attributes -- attributes that you maintain in your registration systems.   You often update registration attributes via third party data append services and from direct updates by the registrants.  You need a way to maintain a current (perhaps historically versioned) copy of the registrant data in HDFS.  You need updates in Hadoop.


Performing data updates in Hadoop is hard.  On a recent project and after a long day of Pig scripting, one of my colleagues said, “My head hurts.  I have to think about things I never had to think about when using databases.”   His experience maintaining dimension data using mature ETL tools and RDBMS technology was great, but now that had to translate to the data programming capabilities in Hadoop.   Comparatively speaking, data programming in Hadoop is, how do we put this nicely, still maturing.


If you want to implement “upsert” (i.e. merge) logic or (gasp) Type 2 dimension logic, you have to build it yourself.   HDFS files are immutable, meaning you cannot edit a file after it has been written. Hive very recently added, and Pig does not (yet) provide an update command, much less a merge statement. Some day they will, but for now you must develop your own algorithms.   Fortunately, there are some patterns which can be applied.


In this blog, we’ll discuss four common approaches:  Hybrid Update, HBase Dimension Store, Merge and Compact Update, and “Good Enough” Update.  There is a fifth less common approach involving the creation of a custom access layer that resolves updates at read time.  This is often reserved extremely high volume data where the batch processing cost to transform all of the data does not offset the cost of transforming the data on-demand.  We won’t cover this last approach.


Hybrid Update

The Hybrid Update approach uses mature ETL and SQL programming to maintain dimensional data within an RDBMS then periodically copies the data, likely using Sqoop, into HDFS replacing the previous version.  All of the hard work to handle special update rules, slowly changing dimensions (i.e. versioning), surrogate key assignment, etc. can be maintained using highly productive ETL programming languages.   The Hadoop “update” becomes a file replacement.   Given that most companies have already invested in a data warehouse to maintain dimensional data, this approach is easiest to implement and often the first adopted.


However, there are downsides.  First, the approach is not scalable.  Suppose you have a billion record dimension table.  With this approach you’d extract a billion records from your RDBMS each time you need to update the dimension data in Hadoop.  There would clearly be performance and capacity implications!  Even if you currently have manageably sized dimensions, as data volume grows, your load performance will degrade. 


Second, this approach requires orchestration between processes running in different environments.   You have to make sure that your RDBMS ETL completes before your extract process starts.   The latency requirements for your Hadoop-based analysis are wholly dependent upon the timeliness and success of your RDBMS load processing.  You could use an ETL technology like Pentaho Data Integration to orchestrate, but your load windows may shrink and exception handling could be complex.


In the end, if all of your dimensions are relatively small with slow growth, then this approach may be a good fit.

Otherwise, we’d only recommend it as a transitional step before embarking on one of the following Hadoop-centric algorithms.


HBase Dimension Store

HBase is a NoSQL database that runs on Hadoop.  Since it provides native support for updates, it is often considered for dimension data maintenance.  In fact, many ETL tools contain pre-built steps to stream inserts, update and deletes into HBase.  You also can ingest “real-time” streams of high velocity data into HBase, minimizing update latency.  And, HBase tables can be queried via SQL on Hadoop options like Hive and Impala with properly configured serdes.   All this makes HBase seductively attractive, but, be warned, there are significant performance challenges when using HBase tables for analytical queries.


Querying Hbase data works best when a single row or a range of rows is selected.   If the query must scan the entire table (as is often the case for analytical queries), HBase is quite inefficient.   In our labs, we set up a small experiment using Impala to contrast performance between a file-based dimension table versus an HBase version.  We created several aggregating “group by” queries that joined our dimension tables to a large, file-based fact table.  In all cases, the file-based dimension table outperformed the HBase version.


If you are 1) already using HBase, 2) comfortable maintaining and balancing its resource needs with the other services running on your Hadoop cluster and 3) able to ensure that your HBase queries always return a single row or a range of rows, then an HBase dimension store is a good solution to the update problem.  Otherwise, we’d recommend you consider one of the next two alternatives.


Merge and Compact Update

Unlike the Hybrid and HBase approaches this update strategy leverages an algorithm rather than a specific technology. The algorithm can be implemented with Java MapReduce, Pig, Hive, Spark, Cascading, Pentaho Visual MapReduce or any other Hadoop data programming language.  It consists of the following steps:


1. Maintain a Current Copy of the Master Data:  A complete copy of your dimension data must reside in HDFS.  For clarity’s sake we’ll call this the “master” data.  This requires a one-time, initial load of the data.  Check out the Inquidia Hadoop Datamart Importer (HDI) for an automated way to do this.


2. Load the Delta Data: Load the newly updated data into HDFS.  We’ll call this your “delta” data.


3. Merge the data:  Join the master and delta data together on the business key field(s).


4. Compact the data: After the merge you will have one or more records for each business key.  In the compaction, you apply logic that emits only one or more records per business key.  In the simplest case, you will emit one record having the maximum update timestamp.   However, if you need to maintain record versions (e.g. type 2 dimensions) then your logic will emit a set of versioned records with non-overlapping effectivity dates.  There may be additional complexity in your compaction process if you have to apply special business logic.  For example, you may never want to update some fields or only update a field when some condition occurs.


5. Write the Data to a Temporary Output: Since most Hadoop jobs (e.g. MapReduce) cannot overwrite an existing directory, you must write the compaction results to a temporary output.  The temporary output is your new dimension table containing a merge of new, updated and unchanged records.


6. Overwrite the Original Master Data: Move the temporary output to the location of the original master data, overwriting the previous version.


This Hortonworks blog illustrates a simple example of this approach using Hive.  It works when there are no complex business rules for the compaction.   If you have business rules that cannot be designed using SQL clauses, you may be limited in Hive.


The primary downside of this approach is that you must read and process the entire dimension table every time you perform an update -- even if only one record is changed.  If your dimension table is 2TB on disk, then you are reading and writing 2TB to perform the update.  Fortunately, there is a way around this issue...


“Good Enough” Update

If you have extremely large dimensions that cannot be processed within your batch window, then you either have to buy a bigger cluster or come up with another approach.  This is where the “good enough” update is applied.


At its heart is an intelligent partitioning strategy whereby you organize your dimension data so that you isolate records which are most likely to be updated into a small subset of the partitions.  This is often done using a record creation or last update date and assuming that recently inserted records are more likely to be updated than older records.  (For type 2 dimensions, you would also isolate the “current” version of each record since older versions are considered immutable.)  Within HDFS, you’d store each partition in a sub-folder of a parent dimension folder.   A Hive table can be defined to recognize these partitions, which might also improve query performance.


With this partitioning strategy in place, you simply perform the Merge and Compact Update on the subset of data considered most likely to change.   You avoid processing the entire dataset with the risk of inserting redundant dimension records.  The redundant dimension records are created when an updated delta record cannot find its previous version in the partition subset.


Of course, the potential to have more than one dimension record for the same dimension key means that query results which include that key might be wrong.   This may persist and potentially compound unless you augment this approach with a periodic merge and compact update on the complete dataset.   We have seen this strategy applied where good enough updates are executed throughout the week and then a full update is run over the weekend.  We have also seen organizations run data quality queries looking for duplicate keys and only running merge and compact jobs if an issue is detected.   In each case, there is a period of potential data inaccuracy which eventually is resolved.


The Good Enough Update approach should only be used when you cannot feasibly execute the full Merge and Compact Update and when there is a tolerance for small data errors.    Given the vagaries of event/sub-transactional data collection, most Hadoop-based analytical use cases already have such a tolerance.


Picking an Update Strategy

As you’d expect, one size does not fit all.   You will likely utilize more than one Hadoop update approach.   So, how do you decide which approach is right for which circumstance?    There are four major (and a whole host of minor) areas for consideration.   First, what is the size of your dimension? Are you dealing with 100’s, 1000’s, … 1,000,000’s of records?   Second, what is your latency requirement?  Do you have a batch window or do you need to see updates occur as changes “stream” in?   Third, how much data volatility exists?   Is this dimension fairly stable with new/changed records trickling in or do you see massive “delta” files with nearly every record being touched?   Fourth, how complex are your update rules?  Is it as simple as overriding the previous record with a delta record or are you implementing type 2 dimension logic with additional update rule dependencies?   The following table considers these factors for each algorithm.



In general, Hybrid is useful for small dimensions that have complex update rules.  HBase is best when latency is a major factor and update rules are simple.   Merge and Compact is the most versatile and often used but has scalability and latency limits.  Good Enough helps solve the Merge and Compact issues but at the cost of potentially introducing data quality issues.


The Future

Hortonworks has added batch insert, update, and delete capabilities in Hive as part of the initiative.  These capabilities are designed for large scale inserts, updates, and deletes as opposed to single record transactions.  Too many transactions over a period of time can lead to a drop in performance.  Additionally, ACID transactions in Hive currently only work with the ORC file format and is a very new immature feature.  As this capability matures we expect this to become an attractive option and widely used capability.


PUM18 has been a great success with around 100 participants from Austria, Germany and Switzerland!
Read the live blog covering all presentations




Save the date: Pentaho User Meeting 2018 for the DACH region (Germany, Austria, Switzerland) will take place on March 6, 2018!


Expect a full day of talks, networking and exchanging ideas starting at 10am and ending at 6pm.


German Federal Police, CERN and Netfonds will share their use cases with us and Pedro, Jens and Matt will present hot stuff from Pentaho 8.0.


Call for Papers

We are looking for users to present their projects! Send your proposal to Stefan at

Talks are 30 long including answers and questions, languages are German and English.



Agenda and more details can be found on the event page


Looking forward to see you in Frankfurt on March 6!

"IoT Hello World"




It was brought to my attention at PCM2017 in November, that I still owe the third blog on using the MQTT plugin for Pentaho Data Integration (PDI). So, thanks for reminding me. For those that haven’t been keeping track, there were two other blogs written earlier this year introducing the MQTT plugin at Pentaho + MQTT = IoT and applying some security techniques to MQTT in Securing IoT Data with Pentaho’s MQTT Plugin. So, to close out this 3 blog series, let’s kick it up a bit. I would like to describe a small project I’ve been working on that uses PDI and MQTT in a bi-directional communication configuration. This means using both the MQTT Publisher and MQTT Subscriber steps within the same transformation. I have tweeted snippets of this project throughout the year as I developed it, and now I’ll explain how it works. Refer to the diagram in Diagram-1 which shows the full architecture of this project, which I call “IoT Hello World”. I use this project for demonstrations, education and as a way to generate streaming IoT data for participants in the #hitachicode1 hack-a-thon events.



Diagram-1: The overall architecture of project "Pentaho - IoT Hello World"


Project Overview


A quick description of what this project does before I explain how it works. This is a 6 degree of freedom (DoF) robotic arm that performs several robotic routines - Displaying "Hello" & "World", then picking up and placing a car on the track (refer to the video above). While looping through these routines, components of the robotic arm, the servo motors, will begin to heat up. Temperature sensors have been attached to all 6 servo motors so that the temperature of the servo motors can be sensed and reported. The data stream is published in comma separated variables (csv) format, that is generated from this data includes,


  • Robotic Arm Controller (RAC) serial number (read from a configuration file)
  • A RAC system timestamp
  • The "Shoulder" servo motor temperature
  • The "Twist" servo motor temperature
  • The "Wrist" servo motor temperature
  • The "Elbow" servo motor temperature
  • The "Base" servo motor temperature
  • The "Claw" servo motor temperature
  • An MD5 hash of first 8 fields of this message
  • The MQTT Topic that this message is published under


An example MQTT message looks like this and is published under the topic "Pen/Robot/Temp",


SN84112-2,2017-07-26 13:57:28,26.0,25.0,25.0,27.0,24.0,24.0,6c18c1515dea2c5ddfcc6c69a18cbedf,Pen/Robot/Temp


There is a another message that is published by the RAC during startup. When the RAC is booted and the "Start" button is pushed, a registration message is published. This message "registers" the RAC with the Corporate Server Application which is stored in the device registration table. The message consists of the following fields in csv format,


  • RAC serial number (read from a configuration file)
  • A RAC description (read from a configuration file)
  • A RAC system timestamp
  • IP address of the RAC
  • The node name of the RAC
  • The PDI software version
  • Latitude
  • Longitude


An example of the device registration MQTT message looks like this and is published under the topic "Pen/Robot/DeviceRegister",


SN84112-2,SN84112-2 Robot Arm Hello World 6 DOF 6 Servos,2017-07-26 13:33:50.138,,ironman,,30.6291065,-100.995013667


Again, refer to Diagram-1. The four main sections of this project, with the fourth component being optional, of this project are,


  1. The “Corporate Server Application” subscribes and publishes to several message queues that,
    1. Subscribes to the "Device Registration" message queue to receive a one time registration information message when the RAC comes online
    2. Subscribes to the "Device Data Stream" message queue for the operational information from the RAC
    3. Publishes the response to the "Corporate Response" messages queue about corporate operating temperature specifications
    4. Publishes a message stream to the "Mobile Stream" message queue for a mobile application to monitor the robot arm remotely
  2. The Robot Controller Application and Sensor Monitoring which registers the RAC with the Corporate Server, collects sensor data and other controller information, and manages all the control buttons, LED indicators and the robotic arm itself.
  3. The MQTT Broker, which is a free MQTT Broker at (but any MQTT Broker can be used) for hosting the four message queues used in this system. The four message queues used are defined as,
    1. Device Data Stream – used for the device data stream
    2. Device Registration – used for the device registration message
    3. Corp Response – used for the corporate server app to send messages back to the robot arm controller.
    4. Mobile Stream – a dedicated message stream for the mobile device app from the corporate server app
  4. A fourth component is a mobile app for remote monitoring. This is not required and the system will run with or without the mobile device present.


As mentioned earlier, what’s happening with this system is while the robot arm is running its routines, the servo motors will begin to heat up. The sensors attached to each servo motor are read continuously and a status LED associated with the temperature readings provide a local visual status. This is called the “Local Vendor Temperature Status” and it covers a temperature range that corresponds to Green (OK), Yellow (WARNING) and Red (OVERHEAT) LEDs. The other set of status LEDs are called the “Corporate Temperature Specifications”. The Corporate Server Application is subscribing to the RAC’s data stream in real-time and responding based on a different definition of Green (OK), Yellow (WARNING) and Red (OVERHEAT) conditions. Basically, the corporate temperature specifications are lower than the vendor temperature specifications.


I called this the “fishing boat situation”. If you've ever ask a fishing boat captain if the boat will go faster, they will respond with “...yes, the vessel is designed to go faster, but by running the engines at half speed, the engines will last twice as long”. This is the same situation here. The robot arm vendor will say that the robot can run at a higher temperature, but the owner of the robot arm (Corporation ABC) wants to operate the robot arm at a lower temperature, or in this case, indicate that the robot arm is operating at a higher temperature than they want it to. This, of course, is just a simulated scenario in order to tell the story of what’s going on.



The Robot Arm Controller


The robot, robotic arm controller (RAC) and monitoring application is a collection of orchestrated PDI transformations that execute through a job that all run on a Raspberry Pi. These PDI components do the following,


  1. Registers the robot with the current GPS location using a GPS module installed on the RPi, various system information, the device’s serial number and device description. This registration information is formatted and published to the “Device Registration” message queue.
  2. The Data Collection component monitors the 6 RobotArmUI.pngtemperature sensors installed on the RAC's servo motors and are connected the RPi. This component publishes the formatted message to the “Device Data Stream” message queue.
  3. Subscribes to the “Corp Response” message queue to receive the current corporate operating temperature specifications.
  4. PDI also orchestrates a suite of python programs that are used to do the physical IO associated with the RAC like,
    1. Turn LEDs indicators on and off
    2. Collect temperature readings from the temperature sensors
    3. Interface with the control buttons
    4. Collect the latitude and longitude data from the GPS module



All the physical general purpose IO (GPIO) connections for LEDs, sensors, GPS coordinates, and robotic arm manipulation and control are performed through various python programs executed from PDI. An example python program looks like this for reading the 6 temperature sensors and outputting a list of temperatures. I picked this code sample to show you because it reads all the sensors in parallel versus sequentially. Each sensor takes several seconds to read and the time adds up if read in sequence.



The Corporate Server


The Corporate Server plays the role of maintaining the company’s operational specification on preferred temperature levels versus the vendor's operating specification. Let’s take a closer look at the transformation running the Corporate Server. In the transformation shown below, see Transformation-1, we can see that this transformation uses both the MQTT Subscriber and MQTT Publisher step simultaneously within the same transformation.


From left to right, the transformation subscribes to the topic “Pen/Robot/Temp” at the MQTT Broker. We then copies the message stream into four threads.


  1. This thread goes to another transformation that builds another message and publishes it to the "Mobile Device" message queue.
  2. This thread splits the message into fields and updates a table in the database (the commit for this database is set to 1 so every record is immediately committed).
  3. This thread also gets a copy of the message and splits out the temperature values and does some calculations on the temperatures from the RAC's temperature sensors. This value is then used in the Switch/Case step to compare it to a table of corporate define temperature values. Base on the result, a MQTT message is published to the “Corp Response” queue on the MQTT Broker with a Green (OK), Yellow (WARNING) or Red (OVERHEAT) status. This status is then used by the RAC to light the corporate LED indicators. This is what I’m referring to as bi-directional communication or full-duplex communication using MQTT. The transformation is subscribing to a topic and receiving a message, then within the same transformation publishing a message.
  4. This thread is just dumping the raw messages straight into a log file locally for prosperity purposes.



Transformation-1: The main transformation for the Corporate Server Application.


There is a second (floating) transformation within this main transformation for catching the Device Registration information when the RAC is started. This should work when there are multiple controllers using this system. In fact, the mobile device stream is a join of the Device Registration table (to get the GPS latitude and longitude, the robot arm controller description information, and the information from the table storing the temperature readings already stored in the database.


That's it! The entire process on the RAC is set up when the Raspberry Pi boots up. The push button interface kicks of scripts that launches kitchen to start the RAC application. The Corporate Server Application is started with a script to start pan.


RACandPhone.pngThis is a great project that has great IoT demonstration value, plus it was just a blast to do. It is also portable so I can travel with it as necessary. I have started on a second version of this robot concept with many more sensors, more complicated routines and other updated components. I looking forward to showing that project some day soon.


Let me know if you have any questions or comments. I know this project is less industrial and more hobby-ish in nature, but it is a great tool for demonstrating complex concepts with Pentaho and generating real IoT data streams.

Back when I took the Johns Hopkins Data Science track on Coursera, one of my homework assignments for the Developing Data Products course was to create a dynamic tool using R and Shiny that would graphically demonstrate the Central Limit Theorem (CLT).  The CLT says that if you take the means of groups of random numbers, then the means will be normally distributed no matter what the underlying distribution of random numbers looks like.  The finished assignment included a histogram that reacted dynamically to user input including the number of random numbers to generate, how many per group, and so on.  The Shiny application would detect user input and dynamically update the histograms in R.


A couple of weeks ago, I wrote about implementing histograms using PDI and CTools: Creating CDE Histograms using PDI and a User Defined Java Class.  If we want a CTools histogram to be dynamic in the sense of my R/Shiny homework assignment, then we need to send parameters from the Ctools dashboard to the kettle transformation.  Fortunately this is easy to do.  The following example was produced using the Pentaho BA Server 7.1, and the complete source is attached.  To see it work, simply upload the attached CentralLimitTheorem zip file to an empty folder in the BA Server.


The first step is to rework the histogram kettle transform example slightly from last week's example.  To demonstrate the CLT, we'll use only the exponentially distributed random numbers, and so we'll remove the rest.  Next, we'll want to add  group structure to the random number generation.  We can do this as shown in Figure 1 by adding an incrementing row id using a Sequence step, adding a group number (= row id modulo number of rows in a group) using a Calculator step, and finally emitting a group average using the Group By step.  Also we use a Get Variables step and an Add Constants step to define histogram dimensions from user input instead of a Data Grid.

Figure 1: Taking average over groups of rows using Sequence, Calculator, and Group By steps.


Next, we want to create a dashboard that uses this transformation.  The basics of how to do this are described in my blog post fromlast week, but briefly, the new kettle transformation is uploaded through the Pentaho User Console, a new CDE dashboard is created, and a kettleTransFromFile datasource is created pointing to the new kettle transform.  The major addition here is that in the current example, we want CTools to pass user input back to the kettleTransform and update the histogram in response.  To do this, we need

  1. to configure the datasource to accept parameters and map them to kettle transformation input parameters,
  2. to create user changeable parameters in the dashboard,
  3. to have the chart to update when the user parameters change

There are several places where parameters have to be defined, and these are demonstrated in the following.


Kettle Datasource Configuration with Parameters


Lets consider the kettleTransFromFile datasource.  In the example, the transformation is called t_Central_Limit_Theorem.ktr and the output step with the histogram is "out: 1".  IMPORTANT: Make sure Cache is set to false, since the chart is to be interactive.  The datasource here is describing a CDA datasource with its own parameters and defaults, and these need to be mapped to the kettle transformation input parameters.  This is done by clicking on the "Variables" item in the datasource properties (See Figure 2).  The entries on the left under the "Arg" title are the CDA parameters.  These are the parameters that that the chart components use when they pull data down from a datasource, and the entries on the left under the "Value" title are the corresponding kettle input parameters defined in the transformation.

Figure 2: The datasource Variables dialog in CDE.


If an entry appears in the Variables table, then a value can be passed from the dashboard to the kettle transform.  If an entry does not appear in this table, then the dashboard cannot see that kettle input parameter, and the default value defined in the kettle transform will be used.


Next, the data type and default value need to be specified for each CDA parameter.  This is done by clicking on the "Parameters" item in the datasource configuration list (See Figure 3).

Figure 3: The datasource Parameters dialog in CDE.


Note that these are "CDA" parameters and defaults - you must use the same name here that you used in the "Arg" column of the Variables configuration table. And the default value given here is what will be passed to the CDQ query (and ultimately to the kettle transformation) if the dashboard does not specify it.


Creating Dashboard Parameters and Inputs


In our example, we want the user to be able to change each parameter in the CDA.  They are going to be used by multiple components, including input text boxes and one or more CCC charting components, so these parameters live in the dashboard.  In the components tab, add a parameter for each and name them accordingly.


Figure 4: Parameters defined in the Components tab.


In order to allow the user to see and change the values of these parameters, some kind of input control must be chosen.  For the example, I'm using simple Text Input Component (text box) inputs, but other input controls like dropdown lists, date ranges and buttons are available as well. Each input should be mapped to a parameter via the properties of the input (see Figure 5), and these will come from the pool of dashboard parameters.  In fact, if you start typing the Parameter name, the CDE will pull up a context menu of prompts drawn from the list of dashboard parameters that have been created so far.


Figure 5: Mapping parameters to Text Input Components.



Adding Charting Components


This is where it gets really interesting.  We're going to work with a bar chart again since we're making histograms.  In order for the histogram to change in response to user input, we have to do two things: we need tell the chart component to react to changes in a dashboard parameter and we need to tell the chart component how to update from its datasource.  That's done in the Listeners and Parameters dialogs respectively of the charting component configuration shown in Figure 6.

Figure 6: The configuration section for a Community Bar Chart component.


To have the chart react to parameter changes, click on the Listeners dialog.  A list of available dashboard parameters will be presented with checkboxes.  Simply check the parameters of interest.  If a parameter is not checked as a listener, then the component will not update when that parameter changes.


What does it mean for a chart to react to a parameter change?  It means that it refreshes its data from its datasource, and so it needs to know how to map the dashboard parameters to the cda parameters in its datasource.  This is done by clicking on the Parameters dialog, and shown in Figure 7.

Figure 7: The Parameters dialog box of a Community Chart Component.


When updating in reaction to a parameter change, the chart component will use the values of the parameters on the right for the CDA datasource arguments on the left.  Note: if a parameter is not listed here, then CDA will use the default value from figure 3.  Also, it is not necessary to use a dashboard parameter here.  A constant value could be used instead if, for instance, the chart has some default value that is not shared by other chart components.


Tying it all Together


So when we look at the final product, we have the following result shown in figure 8.  Back to the Central Limit Theorem, the width of the resulting normal distribution of means is inversely related to the square root of the number of entries in each group, so it is useful to be able to change the histogram definition to track changes in the other input parameters.  Note that the user can also see the underlying exponential distribution by simply setting the group size to 1.  Lots of interactive investigations are possible with this, the whole power of PDI is behind it!


In closing, a note about the parameter mappings we encountered above.  It should be noted that CTools is a collection of distinct technologies: CDA (Community Data Access) which gets data from backend sources like a database or a PDI and returns it in a usable format, CCC (Community Charting Components) which displays graphical information, and the CDF (Community Dashboard Framework) which provides the backbone that knits components together.  The parameter mappings fundamentally exist at the boundaries of these technologies and reflect the immense flexibility of CTools.  The ability to define mappings and defaults at every level makes a rich variety of possible dashboard configurations. For example, to enforce a common parameter value across all dashboard components, simply specify it's default in CDA and leave it out of the component mappings.  Or to change the user experience of the current example and have the chart update in response to a button change rather than react to each parameter change individually, simply add a button and make the histogram listen only to the button.


UPDATED 12/11/2017: The layout of the attached example site (Central Limit was updated using some best practices that I learned about at the CT1000: CTools Fundamentals online course last week.