Greg Graham

Dynamically Exploring the Central Limit Theorem with CDE

Blog Post created by Greg Graham Employee on Dec 2, 2017

Back when I took the Johns Hopkins Data Science track on Coursera, one of my homework assignments for the Developing Data Products course was to create a dynamic tool using R and Shiny that would graphically demonstrate the Central Limit Theorem (CLT).  The CLT says that if you take the means of groups of random numbers, then the means will be normally distributed no matter what the underlying distribution of random numbers looks like.  The finished assignment included a histogram that reacted dynamically to user input including the number of random numbers to generate, how many per group, and so on.  The Shiny application would detect user input and dynamically update the histograms in R.

 

A couple of weeks ago, I wrote about implementing histograms using PDI and CTools: Creating CDE Histograms using PDI and a User Defined Java Class.  If we want a CTools histogram to be dynamic in the sense of my R/Shiny homework assignment, then we need to send parameters from the Ctools dashboard to the kettle transformation.  Fortunately this is easy to do.  The following example was produced using the Pentaho BA Server 7.1, and the complete source is attached.  To see it work, simply upload the attached CentralLimitTheorem zip file to an empty folder in the BA Server.

 

The first step is to rework the histogram kettle transform example slightly from last week's example.  To demonstrate the CLT, we'll use only the exponentially distributed random numbers, and so we'll remove the rest.  Next, we'll want to add  group structure to the random number generation.  We can do this as shown in Figure 1 by adding an incrementing row id using a Sequence step, adding a group number (= row id modulo number of rows in a group) using a Calculator step, and finally emitting a group average using the Group By step.  Also we use a Get Variables step and an Add Constants step to define histogram dimensions from user input instead of a Data Grid.

Figure 1: Taking average over groups of rows using Sequence, Calculator, and Group By steps.

 

Next, we want to create a dashboard that uses this transformation.  The basics of how to do this are described in my blog post fromlast week, but briefly, the new kettle transformation is uploaded through the Pentaho User Console, a new CDE dashboard is created, and a kettleTransFromFile datasource is created pointing to the new kettle transform.  The major addition here is that in the current example, we want CTools to pass user input back to the kettleTransform and update the histogram in response.  To do this, we need

  1. to configure the datasource to accept parameters and map them to kettle transformation input parameters,
  2. to create user changeable parameters in the dashboard,
  3. to have the chart to update when the user parameters change

There are several places where parameters have to be defined, and these are demonstrated in the following.

 

Kettle Datasource Configuration with Parameters

 

Lets consider the kettleTransFromFile datasource.  In the example, the transformation is called t_Central_Limit_Theorem.ktr and the output step with the histogram is "out: 1".  IMPORTANT: Make sure Cache is set to false, since the chart is to be interactive.  The datasource here is describing a CDA datasource with its own parameters and defaults, and these need to be mapped to the kettle transformation input parameters.  This is done by clicking on the "Variables" item in the datasource properties (See Figure 2).  The entries on the left under the "Arg" title are the CDA parameters.  These are the parameters that that the chart components use when they pull data down from a datasource, and the entries on the left under the "Value" title are the corresponding kettle input parameters defined in the transformation.

Figure 2: The datasource Variables dialog in CDE.

 

If an entry appears in the Variables table, then a value can be passed from the dashboard to the kettle transform.  If an entry does not appear in this table, then the dashboard cannot see that kettle input parameter, and the default value defined in the kettle transform will be used.

 

Next, the data type and default value need to be specified for each CDA parameter.  This is done by clicking on the "Parameters" item in the datasource configuration list (See Figure 3).

Figure 3: The datasource Parameters dialog in CDE.

 

Note that these are "CDA" parameters and defaults - you must use the same name here that you used in the "Arg" column of the Variables configuration table. And the default value given here is what will be passed to the CDQ query (and ultimately to the kettle transformation) if the dashboard does not specify it.

 

Creating Dashboard Parameters and Inputs

 

In our example, we want the user to be able to change each parameter in the CDA.  They are going to be used by multiple components, including input text boxes and one or more CCC charting components, so these parameters live in the dashboard.  In the components tab, add a parameter for each and name them accordingly.

 

Figure 4: Parameters defined in the Components tab.

 

In order to allow the user to see and change the values of these parameters, some kind of input control must be chosen.  For the example, I'm using simple Text Input Component (text box) inputs, but other input controls like dropdown lists, date ranges and buttons are available as well. Each input should be mapped to a parameter via the properties of the input (see Figure 5), and these will come from the pool of dashboard parameters.  In fact, if you start typing the Parameter name, the CDE will pull up a context menu of prompts drawn from the list of dashboard parameters that have been created so far.

 

Figure 5: Mapping parameters to Text Input Components.

 

 

Adding Charting Components

 

This is where it gets really interesting.  We're going to work with a bar chart again since we're making histograms.  In order for the histogram to change in response to user input, we have to do two things: we need tell the chart component to react to changes in a dashboard parameter and we need to tell the chart component how to update from its datasource.  That's done in the Listeners and Parameters dialogs respectively of the charting component configuration shown in Figure 6.

Figure 6: The configuration section for a Community Bar Chart component.

 

To have the chart react to parameter changes, click on the Listeners dialog.  A list of available dashboard parameters will be presented with checkboxes.  Simply check the parameters of interest.  If a parameter is not checked as a listener, then the component will not update when that parameter changes.

 

What does it mean for a chart to react to a parameter change?  It means that it refreshes its data from its datasource, and so it needs to know how to map the dashboard parameters to the cda parameters in its datasource.  This is done by clicking on the Parameters dialog, and shown in Figure 7.

Figure 7: The Parameters dialog box of a Community Chart Component.

 

When updating in reaction to a parameter change, the chart component will use the values of the parameters on the right for the CDA datasource arguments on the left.  Note: if a parameter is not listed here, then CDA will use the default value from figure 3.  Also, it is not necessary to use a dashboard parameter here.  A constant value could be used instead if, for instance, the chart has some default value that is not shared by other chart components.

 

Tying it all Together

 

So when we look at the final product, we have the following result shown in figure 8.  Back to the Central Limit Theorem, the width of the resulting normal distribution of means is inversely related to the square root of the number of entries in each group, so it is useful to be able to change the histogram definition to track changes in the other input parameters.  Note that the user can also see the underlying exponential distribution by simply setting the group size to 1.  Lots of interactive investigations are possible with this, the whole power of PDI is behind it!

 

In closing, a note about the parameter mappings we encountered above.  It should be noted that CTools is a collection of distinct technologies: CDA (Community Data Access) which gets data from backend sources like a database or a PDI and returns it in a usable format, CCC (Community Charting Components) which displays graphical information, and the CDF (Community Dashboard Framework) which provides the backbone that knits components together.  The parameter mappings fundamentally exist at the boundaries of these technologies and reflect the immense flexibility of CTools.  The ability to define mappings and defaults at every level makes a rich variety of possible dashboard configurations. For example, to enforce a common parameter value across all dashboard components, simply specify it's default in CDA and leave it out of the component mappings.  Or to change the user experience of the current example and have the chart update in response to a button change rather than react to each parameter change individually, simply add a button and make the histogram listen only to the button.

 

UPDATED 12/11/2017: The layout of the attached example site (Central Limit Theorem.zip) was updated using some best practices that I learned about at the CT1000: CTools Fundamentals online course last week.

Outcomes