Histograms are a great way to probe density functions. Visually they look like ordinary bar charts, but the bars (or bins in histogram-speak) are always quantitative and ordered rather than qualitative. A histogram is defined by n mutually exclusive bins covering a continuous region of possible measure values from low to high. An instance of a histogram is specified by a value associated to each bin, plus a value for underflows and another value for overflows. The value in each bin represents the number of times a measure value fell in that bin, the underflow represents the number of times a measure value fell below the low edge of the first bin, and the overflow represents the number of times a measure value fell above the high edge of the last bin. Observations might be weighted, in which case the values represent the sum of the weights.
In my last blog post, I demonstrated using a User Defined Java Class to measure the time between different workflow steps when a workflow is ill defined or irregular. In this blog post, we'll explore creating a User Defined Java Class in a transformation t_histogram_example.ktr to create and fill histograms defined in an Info Step. Then we'll see how these can be used to drive a CCC Bar Chart in a dashboard created with the Community Dashboard Extension (CDE). In the current example, we'll use a data grid to bring in hard-coded histogram definitions, but these could come in from a configuration database as well.
Figure 1: In this example, the info step is a Data Grid providing hard-coded histogram definitions, each containing an Integer ID, a name, a low specification, a high specification, a number of bins, and an optional field name containing a fill weight for the histogram (default is 1.0).
The measure values that we will be looking at are random numbers generated according to several simple distributions which can be generated with inverse methods: uniform, Gaussian (or Normal), exponential, and logistic. (See http://luc.devroye.org/handbooksimulation1.pdf for more information on how these were generated.) The input parameters of the transformation define the ranges of these random numbers: a low and high for the uniform range, a mean and standard deviation for each Gaussian, and a scale factor for the exponential distribution. These measures come in as data rows.
During initialization of the User Defined Java Class, we get the row set containing histogram definitions using the method findInfoRowStep(). Next, we loop over the info step rows and create the histograms. The histograms are implemented in a Java static nested class, and used with the following methods.
- fill(x): Increments the value of the bin corresponding to x by 1.0
- fill(x, w): Increments the value of the bin corresponding to x by the number w.
- getBins(): returns all of the bin values in an array.
The name field is used to identify a measure in the data rows, which is used to fill the corresponding histogram. If a weight field is specified, then the fill(x) method is used. Otherwise, the fill(x, w) method is used with the value w from the weight field.
The User Defined Java Class accumulates data until there are no more data rows detected. At this point it dumps all histograms to the output. Each row contains the histogram ID, the measure, the bin number, and the bin value. In the example, we split the output rows by histogram ID and send to different dummy steps that can be referenced later.
Figure 2: Output is split by histogram ID.
The output of the first Gaussian using the step preview is shown in Figure 3.
Figure 3: The output of a histogram containing a standard normal distribution with 11 bins from -5.5 to +5.5.
The following assumes that a Pentaho Business Analytics server is installed and running with the Community Dashboard Editor enabled. This example was done on version 7.1, and instructions can be found here: 7.1 - Pentaho Documentation .
To create a visual using the CDE, log into the Pentaho User Console (PUC) and create a new folder. In this example, the folder is called "/home/Histograms". Into this folder, upload the transformation t_histogram_example.ktr, and create a new CDE dashboard. In the data sources panel, create a new "kettleTransFromFile" data source from the Kettle Queries tab. This type of datasource will silently create a Community Data Access (CDA) query which will be the conduit between the PDI transformation and the CDE dashboard. To configure the datasource, specify
- A unique name for the datasource
- The PDI transformation file having the histograms that was uploaded through the PUC
- The PSI step that contains the histogram of interest, in this case out: 2.
- Output Columns should be set up to the bin number and the bin value. The CCC Bar Chart is expecting two input columns: the first should be the X-axis categories of the bar chart, and the second should be the bin values. The structure of the output can be seen from Figure 3, and counting from 0, these are the 3rd and 6th columns.
The rest of the options should be OK from defaults.
Figure 4: Options needed for the kettleTransFromFile to set up a CCC Bar Chart.
In the layout panel, create a row and a column and give the column a name. Change layout parameters as needed to create a reasonably big sized chart. Finally, in the components panel, choose a CCC Bar Chart component and connect it to the layout (using the name given to the column as the "htmlObject") and the datasource using the datasource name. Save the dashboard, and then preview it. It configured correctly, you should see a chart like shown in Figure 5.
Figure 5: The output of the histogram from step out: 2, which has a Standard Normal distribution.
In conclusion, we've demonstrated generating random data and binning it in histograms using a User Defined Java Class in PDI. We also showed how to display these histograms in a CCC Bar Chart in a CDE dashboard. The techniques demonstrated here can be extended to get histogram definitions from a configuration database managed by a web application, or even to output and clear histograms periodically to provide real-time displays.
Update 2017-01-05: When using an Info step with a User Defined Java Class that is set start more than one copy, make sure the Data Movement is set to "Copy Data to Next Steps" on the info step. You can control this setting by right clicking on the Info step. By default the data movement is "Round Robin", and so if the User Defined Java Class step is started with more than one copy, some rows from the Info step would appear to be missing. Thanks to @Jeffrey_Hair for the explanation.