Skip navigation
1 2 3 4 5 6 Previous Next

Pentaho

78 posts

Eliminating Machine Learning Model Management Complexity

By Mark Hall and Ken Wood

 

MLMMDiagram3.png

 

HVLabsLogo.pngLast year in 4-Steps to Machine Learning with Pentaho, we looked at how the Pentaho Data Integration (PDI) product provides the ideal platform for operationalizing machine learning pipelines – i.e. processes that, typically, ingest raw source data and take it all the way through a series of transformations that culminate in actionable predictions from predictive machine learning models. The enterprise-grade features in PDI provide a robust and maintainable way to encode tedious data preparation and feature engineering tasks that data scientists often write (and re-write) code for, accelerating the process of deploying machine learning processes and models.

 

 

“According to our research, two-thirds of organizations do not have an automated
process to update their predictive analytics models seamlessly. As a result, less than
one-quarter of machine learning models are updated daily, approximately one third
are updated weekly, and just over half are updated monthly. Out of
date models can
create a significant risk to organizations.”

- David Menninger, SVP  & Research Director, Ventana Research

 

 

It is well known that, once operationalized, machine learning models need to be updated periodically in order to take into account changes in the underlying distribution of the data for which they are being used to predict. That is, model predictions can become less accurate over time as the nature of the data changes. The frequency that models get updated is application dependent, and itself can be dynamic. This necessitates an ability to automatically monitor the performance of models and, if necessary, swap the current best model for a better performing alternative one. There should be facilities for the application of business rules that can trigger re-building of all models or manual intervention if performance drops dramatically across the board. These sorts of activities fall under the umbrella of what is referred to as model management. In the original diagram for the 4-Steps to Machine Learning with Pentaho blog, the last step was entitled “Update Models.” We could expand the original "Update Models" step and detail the underlying steps that are necessary to automatically manage the models. Then relabel this step to "Machine Learning Model Management" (MLMM). The MLMM step includes the 4-Steps to Machine Learning Model Management, “Monitor, Evaluate, Compare, and Rebuild all Models” in order to cover what we are describing here. This concept now looks like this diagram.

 

MLMMDiagramOldNew3.png

 

 

The 4-Steps to Machine Learning Model Management, as highlighted, include Monitor, Evaluate, Compare and Rebuild. Each of these steps implements a phase of a concept called a "Champion / Challenger" strategy. In a Champion / Challenger strategy applied to machine learning, the idea is to compare two or more models against each other in order to promote the one model that performs the best. There can be only one Champion model, in our case the model that is currently deployed, and there can be one or more Challengers, in our case other models that are trained differently, use different algorithms and so forth, but all running against the same dataset. The implementation of the Champion / Challenger strategy for MLMM goes like this,

 

  1. Monitor - constant monitoring of all of the models is needed to determine the performance accuracy of the models in the Champion / Challenger strategy. Detecting a degraded model's performance should be viewed as a positive result to your business strategy in that the characteristic of the underlying data has changed. This can be viewed as the behaviors you are striving for are being achieved, resulting in different external behaviors to overcome your current model strategy. In the case of our retail fraud prediction scenario, the degradation of our current Champion model's performance is due to a change in the nature of the initial data. The predictions worked and is preventing further fraudulent transactions, therefore new fraud techniques are being leveraged which the current Champion model wasn't trained to predict.
  2. Evaluate - an evaluation of the current Champion model needs to be performed to provide evaluation metrics of the model's current accuracy. This evaluation results in performance metrics on the current 4-stepsMLMM2.pngsituation and can provide both a detailed set visual and programmatic data to use to determine what is happening. Based on business rules, if the accuracy level has dropped to a determined threshold level, then this event can trigger notifications of the degraded performance or initiate automated mechanisms. In our retail fraud prediction scenario, since the characteristic of the data has changed, the Champion model's accuracy has degraded. Evaluation metrics from the evaluation can be used to determine that model retraining, tuning and/or a new algorithm is needed. Simultaneously, all models in the Champion / Challenger strategy could be evaluated against the data to ensure an even evaluation on the same data.
  3. Compare - by comparing the performance accuracy of all the models against each other from the evaluation step, the Champion and the Challenger models can be compared against each other to determine which model performs best, at this time. Since the most likely case is that the current Champion and all the Challenger models were built and trained against the initial state of the data, these models will need to be rebuilt.
  4. Rebuild - by rebuilding (retraining) all the models against the current state of the data, the best performing model on the current state of the data, is promoted to Champion. The new Champion can be hot-swapped and deployed or redeployed into the environment by using a PDI transformation to orchestrate this action.

 

This 4-Steps to Machine Learning Model Management is a continuous process, usually scheduled to run on a periodic basis. This blogs describes how to implement a Champion / Challenger strategy using PDI as both the machine learning and the model management orchestration.

 

The new functionality that provides a new set of supervised machine learning capabilities and the model management enablers to PDI is called Plug-in Machine Intelligence (PMI). PMI provides a suite of steps to PDI that gives direct access to various supervised machine learning algorithms as full PDI steps that can be designed directly into your PDI data flow transformations with no coding. Users can download the PMI plugin from the Hitachi Vantara Marketplace or directly from the Marketplace feature in PDI (automatic download and install). The motivation for PMI is:

 

  • To make machine learning easier to use by combining it with our data integration tool as a suite of easy toconsume steps that do not require writing code, and ensuring these steps guide the developer through its usage. These supervised machine learning steps work “out-of-the-box” by applying a number of “under-the-cover” pre-processing operations and algorithm specific "last-mile data prep" to the incoming dataset. Default settings work well for many applications, and advanced settings are still available for the power user and data scientist. PMIVantaraLogo2.png
  • To combine machine learning and data integration together in one tool/platform. This powerful coupling between machine learning and data integration allows the PMI steps to receive row data as seamlessly as any other step in PDI. No more jumping between multiple tools with inconsistent data passing methods or, complex and tedious performance evaluation manipulation.
  • To be extensible. PMI provides access to 12 supervised Classifiers and Regressors “out-of-the-box”. The majority of these 12 algorithms are available in each of the four underlying execution engines that PMI currently supports: WEKA, python scikit-learn, R MLR and Spark MLlib. New algorithms and execution engines can be easily added to the PMI framework with its dynamic step generation feature.

 

A more detailed introduction of the Plug-in Machine Intelligence plug-in can be found in this accompanying blog.

 

PMIList.pngPMI also provides a unified evaluation framework. That is, the ability to output a comprehensive set of performance evaluation metrics that can be used to facilitate model management. We call this unified because data shuffling, splitting and the computation of evaluation metrics is performed in the same way regardless of which of the underlying execution engines is used. Again, no coding is required which, in turn, translates into significant savings in time and effort for the practitioner. Evaluation metrics computed by PMI include (for supervised learning): percent correct, root mean squared error (RMSE) and mean absolute error (MAE) of the class probability estimates in the case of classification problems, F-measure, and area under the ROC (AUC) and precision-recall curves (AUPRC). Such metrics provide the input to model management mechanisms that can decide whether a given “challenger” model (maintained in parallel to the current “champion”) should be deployed, or whether champion and all challengers should be re-built on current historical data, or whether something fundamental has been altered in the system and manual intervention is needed to determine data processing problems or to investigate new models/parameter settings. It is this unified evaluation framework that enables PDI to do model management.

 

 

Implementing MLMM in PDI

The PDI transformations below are also included in the PMI plugin download complete with the sample datasets.

 

The following figure shows a PDI transformation for (re)building models and evaluating their performance on the retail fraud application introduced in the 4-Steps to Machine Learning with Pentaho blog. It also shows some of the evaluation metrics produced under a 2/3rd training - 1/3rd test split of the data. These stats can be easily visualized within PDI via DET (Data Exploration Tool), or the transformation can be used as a data service for driving reports and dashboards in the Business Analytics (BA) server.

 

PDIRebuildModels.png

DETChartResults.png

 

The following figure shows a PDI transformation that implements champion/challenger monitoring of model performance. In this example, an evaluation metric of interest (area under the ROC curve) is computed for three static models: the current champion, and two challengers. Models are arranged on the file system such that the current champion always resides in one directory and challenger models in a separate directory. If the best challenger achieves a higher AUC score than the current champion, then it is copied to the champion directory. In this way, hot-swapping of models can be made on-the-fly in the environment.

 

ChampChallengerDiagram.png

 

PMI provides the ability to build processes for model management very easily. This, along with its no-coding access to heterogeneous algorithms, automation of “last mile” algorithm-specific data transformations, and when combined with enterprise-grade features in PDI – such as data blending, governance, lineage and versioning – results in a robust platform for addressing the needs of citizen data scientists and modern MI deployments.

 

Installation documentation for your specific platform and a developer's guide, as well as, the sample transformations and datasets used in this blog can be found at here. The sample transformations and sample datasets are for demonstration and educational purposes.

 

It is important to point out that this initiative is not formally supported by Hitachi Vantara, and there are no current plans on the Enterprise Edition roadmap to support PMI at this time.  It is recommended that this experimental feature be used for testing only and not used in production environments. PMI is supported by Hitachi Vantara Labs and the community. Hitachi Vantara Labs was created to formally test out new ideas, explore emerging technologies and as much as possible, share our prototypes with the community and users through the Hitachi Vantara Marketplace. We like to refer to this as "providing early access to advanced capabilities". Our hope is that the community and users of these advanced capabilities will help us improve and recommend additional use cases. Hitachi Vantara has forward thinking customers and users, so we hope you will download, install and test this plugin. We would appreciate any and all comments, ideas and opinions.

Introduction

 

 

PRD is a powerful report tool. However, from my personal experience, some of the most cool stuff is hidden under all the available settings and the combinations that PRD allows. Due to this, only through your experience and others, you will start to understand what and how you can do certain implementations.

I would like to share with you the knowledge gathered during a project using PRD tool.

 

This blog post is divided in sections, where features from particular PRD components are described.

The sections are below, and if you click in one of them you will navigate to the correspondent section:

Tricks only to emphasise the details that certain components features have.

Bear in mind that all the examples in this post use dummy data.

 

General Tricks

 

· How do you format a column retrieved by the query?

  • You need to have that field selected and then click on attributes, format

 

     general1.png

 

· How do you define a name for your sub-report?

  • Select your sub-report

 

general2.png

 

  • Go the attributes and then name

 

general3.png

 

· There are a lot of functions that can be used in the reports:

  • Page - adds page number, you need to put this function in Page Footer
  • Row Banding – defines the color for the even and odd rows in a table

 

Query Tricks

 

· How many queries can be selected in your report?

  • You can only have one query selected per report/sub-report. This have an impact on how you structure your report.

 

· What happens when you change the query name?

  • Whenever you change the query name, the query that was previously selected, will become deselected, forcing the report/sub-report to produce empty reports because it doesn’t know where to fetch data from. The selected query layout should be something like:

 

query1.png

 

 

Parameter tricks

 

· How do you create a date parameter?

  • You need to keep in mind the formatting that you use. Taking as example this date: 2017-01-01 00:00, you should
    • Date Format - yyyy-MM-dd HH:mm. So, both year, day and minutes are with no capital letters. Month needs to be with capital letters, otherwise it will be seen as minutes instead of a month.
    • Default Value - choose a default value to prevent you to type the same value every time you want to generate a report.
    • Value Type - in this example it is a timestamp, but you can choose a date type as well

 

parameter1.png

 

· What are the formats used to display the parameter value?

  • On Message Field component - $(nameOfTheParameter)
  • On Query script - ${nameOfTheParameter)
  • On Functions[nameOfTheParameter]

 

· How do you get the current date and format it in a Message Field Component?

  • Report.date – current date
  • $(Report.date, date, yyyy-MM-dd)

 

· How do you do a cascading filtering in your report?

See the following example:

 

parameter2.png

 

  • Customer prompt will influence the arrayTypeParam prompt and the later will influence the arraySerialParam prompt
    • arrayTypeParam prompt query needs to be filtered by customer value and arraySerialParam prompt query needs to be filtered by arrayTypeParam value
    • The parameters associated with each prompt needs to be in the following order

 

parameter3.png

 

Having the parameters with the order that you want to perform the cascading enables you to activate the cascading filtering that PRD gives you OOB.

 

· How are the parameters passed through the reports?

 

  • To have access to the parameters values between the Master Report and the sub-reports, the Parameters option under Data in the sub-reports, needs to have the following options selected

 

parameter4.pngparameter5.png

 

 

Table tricks

 

· In which sections, should a table be placed?

  • The columns/data retrieved by the query on the Details Body
  • The labels, header of those columns, in the Details Header
    • If you want to repeat the header in each page, you need to set, in the Details Header, Repeat-header property to true.

The Details Body run for each row of your query.

 

· Table Example

Imagine that you have a query with the following result set:

 

Port

Host Storage

Avg Read

Max Read

CL1-A, CL2-A

TOTWSCPPRDHDS03

5

20

CL1-A, CL2-A

TOTWSCTDEVRPT04

12

16

CL1-B, CL2-B

TOTVH10001

6

10

CL1-B, CL2-B

TOTVH10002

12

15

 

and you want to group the Host storages information by Port, see the below image:

 

table1.png

 

            We created a sub-report, where we defined

      • in the Details Header, the port field
      • in the Details Body, the remaining fields: Host Storage Domain,LDEV
      • in the Page Header, the table headers

 

The Page Header section is repeated in every page. As this table is big, occupies more than one page, the table header will be repeated at the top of every page.

 

To iterate the ports, we added in the Group section, the Port column. This way, a new Details Header will be shown every time a different port appears.

 

 

Generate Reports in excel tricks

 

· How do you set the sheet name?

  • To add a sheet name in an excel file you need to select the Report Header of your Master or sub-report and then set the sheetname property

 

excel1.png

 

· How do you create different sheet names?

  • To create different sheets in your PRD report, you need to set, in each sub-report, in the Report Header, the pagebreak-before to true

 

excel2.png

 

· How do you set dynamically the sheet name?

    • Imagine that you have information for different sites in one single query. But you want to display in each sheet the information for each site
      • You need to define a sub-report that
        • has a query that retrieves all the sites, query layout - siteQuery

 

site

EMEA

Korea

India

          • has in Group section the column site selected – Group. It will repeat the sections that it contains (Details section) for each iteration that it makes, in this case, for each site.

 

excel3.png

 

          • in the Details section, you need to add another sub-report (child) that will contain the information for each site
            • As reports inherit the queries from their report father, the site iterated is passed to the report child
            • The query from the report child needs to be filtered by site
            • The Report Header needs to have
              • the pagebreak-before to true
              • the sheetname property needs contain something like

 

excel4.png

 

 

Charts tricks

 

· Which java chart library PRD charts use?

 

· In which section, should a chart be added?

  • In contrary from what happens with tables, you need to add a chart, not in the Details section, but in the Report Header. Otherwise, you will have as many charts as the number of rows retrieved by your /report/subreport query.

· How to configure a time series chart with two plots (Area and Line charts)?

In this case, the Used and Provisioned are in one( primary) plot and the warning in another one (secondary)

  • Choose XY Area Line Chart

 

chart1.png

 

  • Choose Time Series Collector both in Primary and Secondary DataSources
  • Your result set needs to be something like the table below, otherwise, it will only show the series with greater value, in this case Provisioned.

 

Date_time Column (Date Column)

Label Column

Value Column

Threshold Label

Threshold Value

2017/01/01

Used

5

warning

45

2017/01/01

Provisioned

12

warning

45

2017/01/02

Used

6

warning

45

2017/01/02

Provisioned

12

warning

45

2017/01/03

Used

7

warning

45

2017/01/03

Provisioned

12

warning

45

 

chart2.pngchart3.png

 

  • How to format x tick axis labels? The date_time column is in day granularity. But our chart is only showing months. So, if you look at the images above, we need to set
    • time-period-type - Month
    • x-tick-period - Month
    • x-tick-fmt-str (corresponds to the tick label format) – MMM YYYY
    • x-auto-range - True, otherwise we need to specify the x-min and x-max properties
    • x-tick-interval (corresponds to the interval that you want between ticks) - 1 (we wanted to see one tick per month) It is essential to set this property, otherwise it will not apply the format that you specified.
    • x-vtick-label True (ticks will be displayed vertically instead of horizontally)

 

· How to hide the x and y gridlines and the x or y axis?

 

chart4.png

 

  • There is a scripting section that enables you to customise the chart furthermore. You can use several languages: javascript or java. The code is below, using javascript:
    • chart - points to the chart itself
    • chart.getPlot() – will grab some chart properties.
    • chart.getDomainAxis() - picks the x axis properties
    • domain.setVisible(false) – hides the x axis
    • chart.setRangeGridlinesVisible(false) and chart.setDomainGridlinesVisible(false) – will hide gridlines for both axis

 

var chart = chart.getPlot();

var domain = chart.getDomainAxis();

 

domain.setVisible(false);

chart.setRangeGridlinesVisible(false);

chart.setDomainGridlinesVisible(false);

 

· How to set dynamically the x axis ticks of a time series chart depending on a start and end date parameters?

  • Using the scripting section, choosing the java language
    • Include the necessary libraries
    • int difInDays = (int) ((endDate.getTime() - startDate.getTime())/(1000*60*60*24)) - from the startDate and endDate parameters difference, I get the number of days that I want to show. Once that number is converted to ms, when I call DateTickUnit unit = new DateTickUnit(DateTickUnit.HOUR, difInDays, new SimpleDateFormat("dd-MM-yyyy HH:mm")), the java function already knows how many points will show for hour.
    • xAxis.setTickUnit(unit) - will set the tick unit that we set previously

 

import org.jfree.chart.axis.ValueAxis;

import org.jfree.chart.axis.DateTickUnit;

import org.jfree.chart.axis.DateAxis;

import org.jfree.data.Range;

import java.text.SimpleDateFormat;

import java.util.*;

import java.math.*;

import java.util.Date;

 

import org.jfree.chart.plot.Plot;

 

// Get the chart

Plot chartPlot = chart.getPlot();

 

ValueAxis xAxis= chartPlot.getDomainAxis();

 

 

Date startDate = dataRow.get("startDate");

Date endDate = dataRow.get("endDate");

int difInDays = (int) ((endDate.getTime() - startDate.getTime())/(1000*60*60*24));

 

DateTickUnit unit = new DateTickUnit(DateTickUnit.HOUR, difInDays, new SimpleDateFormat("dd-MM-yyyy HH:mm"));

 

xAxis.setTickUnit(unit);

 

· How to define dashed lines and specific colours in certain series in a line chart?

  • Using the scripting section, choosing the java language and include the following code:
    • In the query, the series that I wanted to paint and add dashed lines were at the top
    • chartPlot.getSeriesCount() – access the result set of the query
    • renderer.setSeriesPaint(i,  Color.green) – set the color, for a given row i
    • renderer.setSeriesStroke(i, new BasicStroke(1.0f,BasicStroke.CAP_BUTT, BasicStroke.JOIN_MITER, 10.0f, new float[] {2.0f}, 0.0f)) - set the shape style (BasicStroke.CAP_BUTT) and new float[] {2.0f} argument corresponds to the width of the dashed line
    • renderer.setSeriesShapesVisible(i,false) – will disable the line shapes on the remaining series
    • chartPlot.setRenderer ( renderer ) – updates the chart configurations with the ones specified in the scripting section

 

import java.awt.Color;

import java.awt.BasicStroke;

import java.awt.Stroke;

 

import org.jfree.chart.plot.XYPlot;

import org.jfree.chart.renderer.xy.XYLineAndShapeRenderer;

      import java.util.*;

import java.math.*;

import java.util.Date;

 

// Get the chart

XYPlot chartPlot = chart.getXYPlot();

 

 

XYLineAndShapeRenderer renderer = new XYLineAndShapeRenderer( );

for (int i = 0; i < chartPlot.getSeriesCount(); i++) {

      if(i <3){

                  if(i <1){

                              renderer.setSeriesPaint(i,  Color.orange);

                  }else if(i < 2){

                              renderer.setSeriesPaint(i,  Color.red);

                  }else{

                              renderer.setSeriesPaint(i,  Color.green);

                  }

 

                  renderer.setSeriesStroke(i, new BasicStroke(1.0f,BasicStroke.CAP_BUTT, BasicStroke.JOIN_MITER, 10.0f, new float[] {2.0f}, 0.0f));

      }

 

      renderer.setSeriesShapesVisible(i,false);

 

}

 

chartPlot.setRenderer( renderer );

 

 

Css tricks

 

· How can you stylise items, without using the style properties from the left panel?

      • It is possible to use an external stylesheet or the internal stylesheet - consult here how you specify the css style. However, there are some css properties that take precedence in relation to others. Imagine that you have the following example:
        • In the report-header, under a band element, there are two elements: text-field and a message.
        • The following Rules were defined:
          • report-header
            • font-size:16
            • background:blue
          • report-header text-field
            • font-size:16
            • background:pink
          • band text-field
            • font-size:10
            • background:purple
        • It was expected that
          • text-field should have the font-size:10 and background:purple, because band text-field Rule is more precise than report-header text-field Rule, as the band is inside the report-header
          • message should have font-size:16 and background:blue, because the other Rules refer to text-field
        • The output was
          • text-field had font-size:16 and background:pink, applied  report-header text-field Rule
          • message had the expected Rule applied
        • Even, adding the report-header to band text-fieldRule does not make any difference. The solution is to not mixture bands elements with other elements such as report-header,group-header, text-field. Use only bands with id and style-class properties under attributes panel. For instance, if you add a style-class, to band: reportHeader and a style-class to text-field: textField:
          • band.reportHeader
            • font-size:16
            • background:blue
          • band.reportHeader .textField
            • font-size:10
            • background:purple
        • The expected and output are the same
          • text-field have the font-size:10 and background:purple
          • message have font-size:16 and background:blue.

 

· When should not you use bands?

      • Imagine two sub-reports, one at the bottom of the other, and each one
        • is inside a band
        • contains a table

 

css1.png

 

      • As bands do not have the “overlap notion”, in other words, if you place one on top of the other, you won’t get the red alarm rectangle, meaning they are overlapping. So, if the first sub-report has a lot of rows to display, they will get overlapped with the rows returned by the second sub-report.
      • Due to this, do not use sub-reports inside bands.

This was originally published by Dave Reinke & Kevin Haas on Monday, February 22, 2016

 

Our previous blog on the Rise of Enterprise Analytics (EA) created quite a stir. Many readers had strong reactions both for and against our perspective. Several pointed comments about the continued importance of centralized, enterprise data repositories (lakes, warehouses, marts) gave us pause. To summarize: “ How dare you even consider throwing away years of best practice data design and engineering and return us to an age of inconsistent spreadmarts and siloed Access databases. You should be ashamed!”

 

The critics will be heartened to learn we’re not advocating giving up entirely on IT-managed enterprise data. On the contrary, we believe the adoption of Enterprise Analytics mandates even more attention to and extension of best practice enterprise data management and engineering.

The Power of Analytic Producers Is Reforming How IT Manages Data

The subtle differentiation we’re making is between the data itself, and the analytics on that data. EA is about shifting analytic production toward those in the organization who drive innovation from data, i.e. the Analytic Producers. Analytic Producers are typically business analysts, data scientists and others responsible for measuring and forecasting performance and identifying new, data-driven products and opportunities.

 

Most of our recent projects have revolved on the enablement of Enterprise Analytics through a modern, extensible data architecture. One that relies on a foundation of governed dimensional data warehousing and modern big data lakes, while simultaneously enabling analysts to create and blend their own datasets. As Analytics Producers find value in adjunct datasets, that data is then integrated into what we call the “enterprise data ecosystem” (EDE). In contrast to the traditional EBI ecosystem, the vitality of the EDE is driven by business priorities and analytic innovations -- not the other way around.

 

 

ea-flow_0.png

 

The picture above depicts how the old EBI and new EA worlds integrate. The blue elements should look very familiar to EBI colleagues. This is the domain of stewarded enterprise data and standardized access mechanisms for “Analytics Consumers”. Most are probably familiar with the classic reporting, OLAP and dashboards provided by legacy BI vendors.

 

New Analytics Technologies Have Also Upset Traditional Data Semantic Governance

In addition to core EBI, we’ve added boxes for the increasingly prevalent statistical tools and libraries such as R, Python and Spark used by data scientists. Further, analytical apps built with R/Shiny, Qlik, Tableau, etc. provide tailored, managed access to data via highly visual and ubiquitous web and mobile interfaces. Indeed, Inquidia’s business is now more focused on analytical app dev than it is on dashboards and reports enabled via legacy commercial EBI tools. (More on this in an upcoming blog…)

 

The orange elements of the diagram depict new architectural elements driven by Enterprise Analytics clients. Ad-hoc data discovery and the ability to experiment with new data sources drives the need. Depending on the Analytics Producer, the new data sources range from simple spreadsheets to data scraped from the web -- and curated using agile programming languages like Python, R, Alteryx and even freely-available ETL software such as Pentaho Data Integration. Additionally, for some of our clients, we help create “data sandboxes” where Analytics Producers combine extracts (we often are asked to build) of enterprise data with their new, embellishing datasets for ease of blending.

 

A Modern Approach to Collaborative Enterprise Analytics Yields Benefits for Analysts and IT

Central to EA is the ability for Analytic Producers to share discoveries and collaborate. The Shared Analytics Results Repository provides this functionality. Many of our clients enable this sharing using Tableau server, though the same results could be attained through other low cost approaches including Tableau desktop with file sharing, internal wikis, Google Drive & Docs, etc. There’s certainly no need to reinvent collaboration technology.

 

Inevitably, a new “hot” analytic will emerge from EA initiatives -- one that is in demand by traditional Analytics Consumers. This is where expert enterprise data architecture and engineering is critical -- and often where data integration expertise plays a helping role. The gray boxes depict the escalation process with outputs detailing new data integration and semantic requirements. The orange “New Sources” box represents the extensibility of the data ecosystem. Depending on the nature of the data, it may land in the classic data warehouse or become part of the big data lake (e.g. Hadoop). The orange “Integrated User Models” box shows the extension of the enterprise semantic driven by the newly integrated data. These data may manifest in cubes, ad-hoc report metadata, or new analytical app requirements.

 

We hope this deeper dive into the nature of emerging Enterprise Analytics will allay fears of our colleagues that data architecture and engineering are no longer critical. The revolutionary concept of EA is not rampant decentralization of enterprise data, but rather an acknowledgement that for many business organizations (and perhaps yours), significant analytic expertise resides outside of IT. These analytics constituents must be serviced with more flexibility and agility for an enterprise that wishes to drive innovation through analytics.

This post was originally published by Dave Reinke & Kevin Haas on Wednesday, January 27, 2016

 

Is Enterprise BI dying? That’s the question our colleagues have been debating the past few months. We’re heavily wired into the business intelligence marketplace and have seen the nature of our projects change recently. Fewer clients are asking for classic ad-hoc query, reporting and analysis provided by enterprise BI platforms such as BusinessObjects, Cognos, Microstrategy and Pentaho.

 

Rather, clients are obsessed with providing data services to a growing potpourri of visual analytic tools and custom built analytic apps. More organizations expect data-driven, tangible evidence to support their decisions. The fundamental shift is from an IT mandated common data semantic via a monolithic BI platform to an assortment of “BYO” analytics technologies that depend on the creativity and self-reliance of business analysts and data scientists to meet enterprise analytical needs. Perhaps we are seeing the rise of a new analytics philosophy. Are we witnessing the Rise of Enterprise Analytics?

 

A Brief History of Enterprise BI

Enterprise BI platforms began life in the 1990’s when upstart disrupters like BusinessObjects and Cognos promised to “democratize data access” and introduce “BI to the masses.” For the most part, they delivered on these promises.

 

The core concept was a centralized, shared data semantic, enabling users to interact with data without requiring an understanding of the underlying database structure or writing their own SQL. Yes, SQL. All data for these platforms had to be queried from relational databases, preferably dimensionalized data warehouses that were designed and populated by IT.

 

The Enterprise BI platforms provided tremendous value to organizations that were starved for consistent data access. Once the underlying data was organized and a semantic defined, users were guaranteed conformed data access via ad-hoc and canned query, reporting and analysis modules. Additionally, complex reports and dashboards could be stitched together from common components. Nirvana...unless you were paying the license fees or wanted to switch to a new platform.

 

Excessive licensing fees and lock-in began the undoing of the monolithic BI platforms as open source technologies like Pentaho and Jaspersoft aggressively commoditized. However, even the open source options were still bottlenecked by a dependence on centralized IT to organize data and define a common semantic. Time for the next disruption…

 

The Trend is not Enterprise BI’s Friend: Five Trends Sparking the Rise of Enterprise Analytics

For context, consider how radically user’s expectations of technology have changed since the Enterprise BI platforms took shape in the 1990’s. We’ve identified five “megatrends” that are particularly relevant for analytics. First, technology has become high touch and amazingly intuitive. High touch as in actually touching via tablets and phones. Apps and websites don’t come with binders or user manuals. You download and use, figuring it out along the way.

 

Second, technology is perpetually connected, enabling interaction with people and things anywhere. We expect to be able to do things at any time, any place and on any device. We change our home thermostat from across the country and speak with colleagues on the other side of the globe for little or no cost. Simply amazing if you stop to think about it.

 

Third, technology answers questions now. We’ve become impatient, no longer willing to wait even for the simple latency of an email exchange. Ubiquitous connectivity and Google are now taken for granted by a new generation of perpetually informed consumers.

 

Fourth, the increasing compute power in the hands of every business analyst is changing their problem solving approach. Data scientists can solve business problems by processing even more data with vastly more sophisticated algorithms than ever before. This has yielded technologies that visually depict these advanced analytics, resulting in greater experimentation, and an embrace of the scientific method.

 

Finally, technological sharing and collaboration is the new norm. Social networks have taught us that if we participate, then we will get more than we give. The open source software development model has spilled into just about every domain, harvesting the benefits of collaboration and improvement via derivative works. The trends empower and embolden the individual and stand in stark contrast to the command and control deployment inherent in classic Enterprise BI platforms.

 

Enter Enterprise Analytics

The legacy, centralized approach of Enterprise BI simply hasn’t recognized and responded to these trends.

 

Imagine an enterprise that leverages IT and engineering resources to provide a shared, secure data asset, but also fosters an ecosystem where analytics evolve through creativity, exploration, collaboration and sharing. An ecosystem where analytics take on a life of their own; where the “best-fit” analytics thrive and pass their “DNA” on to new generations built from an expanding data asset. Markets are won by data-driven organizations that learn faster and execute better than their competitors.

 

This is the vision for Enterprise Analytics.

 

As long time BI veterans, we were taught to root out siloed analytics, spreadmarts and the like. One of the commonly argued benefits for Enterprise BI platforms is that reported metrics are guaranteed to be consistent no matter how many users ask for them. There would be no more arguing over whose numbers are right. Rather, energy was to be spent interpreting and acting. We found this to be true with users rapidly absorbing the IT-managed metrics. However, just as quickly as we delivered the standardized, IT-governed metrics, users demanded new metrics requiring the rapid integration of increasingly varied and voluminous data. Few IT organizations could respond with the necessary agility, and innovation was stifled.

 

Adopting Enterprise Analytics changes the dynamic between user and IT. IT becomes an enabler, providing a shared and secure data infrastructure while users are empowered to create, share and improve analytics. For sure, the path is not straight. There are bumps along the way with arguments over whose metrics are more apt, etc. But the benefits of rapid innovation overpower the stagnation that comes from lack of analytical agility.

 

With a platform that enables collaboration, users are more apt to reuse and then extend metrics as they produce new analytics -- experimenting to improve an organization’s understanding. The costs of a little less governance are far outweighed by the benefits of rapidly improving actionable insight.

 

What's Next in Enterprise Analytics

Although the opportunity of Enterprise Analytics is staggering, Enterprise BI is not going to disappear overnight. We’ll still need pixel perfect and banded reports to satisfy regulations, official documents, operational norms and tailored communication. Privacy requirements still demand secured and managed access for wide swathes of enterprise data -- access that likely requires a stable, common semantic for which Enterprise BI platforms excel. And, we’ll increasingly see analytics delivered via “apps” with tightly scoped, but well-directed functionality to address a specific business process and targeted audience. Not everything will be 100% ad-hoc.

 

But, in our view, the reality of how business analysts and data scientists work, the tools they use, and the information they have access to is inciting a real change in the way that individuals are using information. Enterprise Analytics is at hand, and organizations that do not respond to this reality will find themselves increasingly irrelevant.

 

In future blogs, we’ll expand on the concepts introduced here, elaborating on the benefits of maintaining an Enterprise Analytics portfolio that consists of business-led Data Exploration, Data Science, Analytical Apps and governed data access and reporting. We’ll also discuss how to start and grow your own Enterprise Analytics ecosystem discussing technologies and techniques that work and the changed but still critical role of central IT. Along the way we’ll share insights and experiences as we enter this unquestionably exciting new age.

One of the most heavily discussed topics in machine learning and data mining today is sentiment analysis. For the uninitiated, sentiment analysis is a goal to classify text as positive or negative based only on previously classified text. In this article, I will attempt to classify the sentiment of Twitter comments about a certain movie, based only on a dataset of 10,662 movie reviews, released in 2005. This solution will be demonstrated using 2 methods—once using only Pentaho Data Integration (with some R), and a more sophisticated solution will be built using Weka.

Understanding the Naïve Bayes Classifier

Although many machine learning algorithms become very complex and difficult to understand very quickly, the Naïve Bayes classifier relies on one of the most fundamental rules in statistics, allowing its results to be highly interpretable, while also maintaining a high degree of predictive power. It is based upon Bayes’ Rule, which can be used to predict conditional probability. The equation reads:

Applying Bayes’ Rule to sentiment analysis to classify a movie as bad given a specific review of “I hated it” would be:

The classifier is called “naïve” because we will assume that each word in the review is independent. This is probably an incorrect assumption, but it allows the equation to be simplified and solvable, while the results tend to hold their predictive power.
Applying Bayes’ Rule has allowed us to dramatically simplify our solution. To solve the above equation, the probability of each event will be calculated.
  • P("I"|negative) can be described as the total number of times “I” appears in negative reviews, divided by the total number of words in negative reviews
  • P(negative) is the total number of words that are in negative reviews divided by the total number of words in the training data
  • P("I") is the total number of times “I” occurs in all reviews divided by the total number of words in the training data

 

We can then do the same above equation and replace the occurrences of negative with positive.  Whichever probability is higher allows us to predict a movie review’s sentiment as negative or positive. The expectation would be that hated occurs significantly more often in the negative reviews, with the other terms being similar in both classes, thus allowing us to correctly classify this review as negative.

 

Build a Naïve Bayes Model using Pentaho Data Integration

To build the model in Pentaho, there are a few steps involved.  First, we must prepare the data by cleaning the data. Once this is done, we then build the terms for each word in the classifier. Lastly, we test the performance of the model using cross-validation.

Step 1: Cleaning and Exploring the Source Data

To perform the sentiment analysis, we’ll begin the process with 2 input files—1 for negative reviews and 1 for positive reviews. Here is a sample of the negative reviews:

 

To clean the data for aggregation, punctuation is removed and words are made lowercase to allow for a table aggregated by class and word. Using the data explorer, we can start to see the word count differences for some descriptive words. These numbers intuitively make sense and help to build a strong classifier.

 

 

Step 2: Building Terms for the Classifier

Next, we build the various terms for the classifier. Using the calculator steps, we need the probabilities and conditional probabilities for each word that occurs either in a negative review or positive review (or both) in the training data. The output from these steps then creates the parameters for the model. These need to be saved, so eventually they can be used against testing data. Here is a sample:

 

It can be noted that some of these word counts are null (zero). In the training data, this only occurs if a word count is zero for one of the two classes. But in the test data, this can occur for both classes of a give word. You will notice that the conditional probabilities for these null words are nonzero. This is because Add-1 Smoothing is implemented. We “pretend” that this count is 1 when we calculate the classifier, preventing a zero-out of the calculation.

 

To calculate the classifier for a given instance of a review, like the formula previously explained, we must apply the training parameters to the review—that is match each word in the review being classified with its probability parameters and apply the formula. It can be noted that when we solve the equation, we take the log of both sides because the terms being multiplied are very small.

 

Step 3: Model Accuracy using Cross-Validation

You will notice there is a Section #3 on the transformation to see how well our classifier did. It turns out, that this is not the best way to check the accuracy. Instead, we will check the results using cross-validation. When building a model, it important not to test a model against the training data alone. This will cause overfitting, as the model is biased towards the instances it was built upon. Instead, using cross-validation we can re-build the model exactly as before, except only with a randomly sampled subset of the data (say, 75%). We then test the model against the remaining instances to see how well the model did. A subset of the predictions, with 4 correct predictions and 1 incorrect prediction, from cross-validation can be seen here:

 

 

Ultimately, using cross-validation, the model made the correct prediction 88% of the time.

 

Test the Naïve Bayes Model on Tweets using Pentaho Data Integration and R

To read Tweets using R, we make use of two R libraries, twitteR and ROAuth. A more detailed explanation to create a Twitter application can be found here.

 

This allows for stream of tweets using the R Script Executor in PDI. We will test the model using Jumanji: Welcome to the Jungle, the movie leading the box office on MLK Jr. Day Weekend. Using the following code, we can search for recent tweets on a given subject. The twitteR package allows us to specify features, like ignoring retweets and using only tweets in English.

 

tweetStream = searchTwitter(‘Jumanji’ ,lang='en' ,n=100) 
dat = do.call("rbind", lapply(tweetStream, as.data.frame)) 
dat = dat[dat$isRetweet==FALSE,] 
review = dat$text 
Encoding(review) = "UTF-8" 
review = iconv(review, "UTF-8", "UTF-8",sub='') ## remove any non UTF char 
review = gsub("[\r\n;]", "", review) 

 

Here is sample of the incoming tweets:

 

 

Clearly, these tweets are not of the same format as the training data of old movie reviews. To overcome this, we can remove all @ mentions. Most of these are unlikely to affect sentiment and are not present in the training data. We can also remove all special characters—this will treat hashtags as regular words. Additionally, we remove all http links within a tweet. To keep only tweets that are likely to reveal sentiment, we will only test tweets with 5+ words.

 

To get predictions, we now follow the same process as before, joining the individual words of a tweet to the training parameters and solve the classifier. Here is a sample of the results, along with my own subjective classifier:

 

Predicted Class

Subjective Class

Tweet

negative

positive

I have to admit this was a fun movie to watch jumanji jumanjiwelcometothejungle action httpstcopXCGbOgNGf

negative

negative

Jumanji 2 was trash Im warning you before you spend your money to go see it If you remember the first you wont httpstcoV4TfNPHGpC

negative

positive

@TheRock @ColinHanks Well the people who have not seen JUMANJI are just wrong so

positive

positive

Finally managed to watch Jumanji today Melampau tak kalau aku cakap it was the best movie I have ever watched in my life

negative

negative

Is Jumanji Welcome to the Jungle just another nostalgia ploy for money Probably httpstcoDrfOEyeEW2 httpstcoRsfv7Q5mnH

positive

positive

Saw Jumanji today with my bro such an amazing movie I really loved it cant wait to see more of your work @TheRock

positive

positive

Jumanji Welcome to the Jungle reigns over MLK weekend httpstcoOL3l6YyMmt httpstcoLjOzIa4rhD

 

One of the major issues with grabbing tweets based on a simple keyword is that many tweets do not reveal sentiment. Of the 51 tweets that were tested (the other 49 were either retweets or did not contain 5 words), I subjectively determined only 22 of them contained sentiment. The successful classification rate of these tweets is 68%. This is significantly less than the success rate in the cross-validation, but can be explained by the different use of language between the training set and the tweets. The slang, acronyms and pop culture phrasing used on Twitter is not prevalent in the movie review training data from 2005.

 

Enhancing the Model with Weka:

The Naïve Bayes model can be greatly enhanced using Weka. Weka provides powerful features that can be applied within a simple interface and fewer steps. Using their pre-built classifiers, the parameters can be easily tuned. Here, Multinomial Naïve Bayes is used. First, the reviews are split by word, as required by Naïve Bayes, by using the StringToWordVector filter. Additionally, 10-fold cross validation is used. Instead of building the model once as we did before the data is randomly partitioned into 10 sets. The model is run 10 times, leaving 1 set out each time and then the ten models are averaged out to build the classifier. This model will reduce overfitting, making it more robust to the tweets.

 

Here is the output from the model:

 

 

When the tweets are scored using the PDI Weka scoring step, the subjective successful prediction rate increased slightly to 73%.

Pentaho Data Integration comprises a powerful high throughput framework for moving data through a network of transformation steps, turning data from one form into another.  PDI is an excellent choice for data engineers because it provides a intuitive visual UI so that the engineer gets to focus on the interesting business details.  PDI is also easily extendable.

 

PDI traditionally focuses on large scale re-aggregation of data for data warehousing, big data, and business intelligence, but there are lots of other interesting things you can contemplate doing with embedded PDI.  Coming from a high frequency trading background, when I looked at the default list of transformation steps I was surprised to not see a generalized UDP packet processor.  Data transmitted over UDP is everywhere in the trading world - market data often comes in UDP and so do custom prompt trading signals.   So I ended up creating a toy implementation of transformation steps UDPPacketSender and a UDPPacketReceiver to send/receive data over UDP.

 

Part I of this blog post will introduce some concepts describing what UDP is and is not, an introduction to packet handling in Java, and how to write transformation steps for PDI.  It is intended to be an introduction, and is not exhaustive.  Where applicable, links are provided to further reading.  Code is provided separately in a GitHub repository for educational purposes only, no warranty is expressed or implied.   Part II of this blog post will include some kettle transformations and demonstrations of the UDPSender and UDPReceiver in action.

 

What is UDP?

 

So what is UDP anyway and why should we care about it?  Information is generally sent over the internet in one of two protocols: UDP and TCP.  UDP stands for User Datagram Protocol, and it is a lightweight, connectionless protocol that was introduced in 1980 by David Reed (RFC 768 - User Datagram Protocol  ) as an alternative to the comparatively high overhead, connection-oriented network Transmission Control Protocol, or TCP.  Among the properties of UDP:

  • Connectionless
  • Does not guarantee reliable transmission of packets
  • Does not guarantee packet arrival in order
  • Multicast/Broadcast

By contrast, TCP is connection oriented and has protocol mechanisms to guarantee packet transmission and ordered arrival of packets.  But it cannot do Multicast/Broadcast, and TCP has higher overhead associated with flow control and with maintaining connection state.  So UDP is a good candidate when latency is more important than TCP protocol guarantees, or when packet delivery guarantees can be ignored or engineered into an enterprise network or into the application level.

Much of the time, UDP packet loss occurs simply because intervening network devices are free to drop UDP packets in periods of high activity when switch and router buffers fill up.  (It's rarely due to an errant backhoe, although that does happen, and when it does it's a problem for TCP as well :-)

 

For many traditional UDP uses like VoiP, packet loss manifests as hiss and pop and is mitigated by the filtering and natural language processing capabilities of the human brain.  For many other uses, like streaming measurement data, missing measurements can be similarly imputed. But, it is important to note, UDP packet loss may not be a problem at all on enterprise networks provided that the hardware is deliberately selected and controlled to always have enough buffer.  Also it should be noted that socket communication is an excellent mode of inter-process communication (IPC) for endpoints in different platforms on the same machine.

 

Finally, if packet loss is a problem, a lightweight mitigation protocol may be implemented in the application on top of UDP.  More information on the differences between UDP and TCP can be found here: User Datagram Protocol or Transmission Control Protocol.

 

Creating PDI Transformation Steps

 

There is excellent documentation on how to extend PDI by creating your own transformation steps https://help.pentaho.com/Documentation/8.0/Developer_Center/PDI.  The most relevant information for creating additional PDI plugins is located at the bottom of the page under the heading "Extend Pentaho Data Integration".

 

In a nutshell, you'll need a working knowledge of Java and common Java tools like eclipse and Maven.  First we'll explore sending and receiving packets in Java, and move on to a discussion of threading models.  Then we'll move on to a discussion of implementation of the four required PDI interfaces.

 

Sending and Receiving UDP Packets in Java

 

Since buffer overflow is usually the main culprit behind lost UDP packets, it is important to grab packets from the network stack as quickly as possible.  If you leave them hanging around long enough, they will get bumped by some other packet.  So a common pattern for handling UDP packets is to have a dedicated thread listening for packets on a port, read the packets immediately and queue the packets up for further processing on another thread.  Depending on how fast packets are coming in and how burst-y the incoming rate is, you may also need to adjust buffer size parameters in the software or even on the network device directly. 

This implementation uses classes in the java.nio package.  These classes are suited for high performance network programming.  Two classes of interest are ByteBuffer and DatagramChannel.  You can think of ByteBuffer as a byte array on steroids:  it has get and set methods for each basic data type and an internal pointer so that data can be read and written very conveniently.  Also, ByteBuffer has the possibility to use native memory outside of the Java heap, using the allocateDirect() method.  When passed to a DatagramChannel, then received data can be written to the native memory in a ByteBuffer without the extra step of copying the data initially into the Java heap.  For low latency applications this is very desirable, but it comes at the cost of allocation time and extra memory management.  (For both of these reasons,  it's a good idea to pre-allocate all of the ByteBuffers anticipated and use an ObjectPool.)

 

Threading Model

 

A word of caution is in order here.  Multi-threaded application programming is challenging enough on its own, but when you are contemplating adding threads inside of another multi-threaded application that you didn't write, you should be extra careful!  For example, PDI transformations are usually runnable within Spoon, the UI designer for PDI transformations.  Spoon has start and stop buttons, and it would be extra bad to cause a transformation to not stop when the user hits the stop button because of a threading issue, or to leak resources on multiple starts and stops.  Above all, don't do anything to affect the user experience.

If at all possible, when designing a transformation step that involves some threading, look at a similar example or two.  When creating this implementation, I looked at the pentaho-mqtt implementation on GitHub located at pentaho-labs/pentaho-mqtt-plugin.  This was very useful because I could look at how the mqtt-plugin developers married a network implementation capable of sending/receiving packets to PDI.  The model I ultimately settled upon has the following characteristics:

  • A startable/stoppable DatagramChannel listener with a single-use, flagged thread loop, and an "emergency" hard stop in case a clean join timeout limit was exceeded.
  • Assumes that the putRow() method is thread safe.
  • Uses an object pool of pre-allocated ByteBuffers to avoid inline object construction costs.
  • Uses a Java thread pool (ExecutorService) to both process packets and to serve as a packet queue.

Looking at the pentaho-mqtt-plugin and understanding the flow of data and the initialization sequences greatly speeded up my own development efforts.

 

Implementation

 

Each PDI transformation step has to implement all of the following four interfaces.

  • StepMetaInterface: Responsible for holding step configuration information and implementing persistence to XML and PDI repository.
  • StepDataInterface: Responsible for keeping processing state information.
  • StepInterface: Implements the processing logic of the transformation.
  • StepDialogInterface: Implements the configuration UI dialog for the step.

I'm not going to go into a lot of detail on the coding here, but the source code is provided for reference at the GitHub repositories

  • ggraham-412/ggutils v1.0.0 contains utility classes that implement logging that can be injected with alternative logging implementations at runtime, an object pool implementation, and UDP packet sender/receiver objects that wrap DatagramChannel and ByteBuffer objects together with a simple packet encoder/decoder.
  • ggraham-412/kettle-udp  v1.0.0 contains the implementations of the UDPReceiver and UDPSender steps

(Again, the code is provided for educational purposes only and no support or warranty is expressed or implied.)

The basic flow for UDPSender is simple.  We're expecting data to come in from PDI into the UDPSenderStep.  So in the processRow() method (which is invoked by the framework on every row) we basically get a row of data, find the fields we're interested to send, and send them on a UDPSender from the ggutils library.  If the row we get is the first one, we open the UDPSender ahead of processing.  If the row we get is null (end of data) then we close the UDPSender and signal that the step is finished by returning false.

 

The flow in UDPReceiver is a little more complicated because it is ingesting externally generated data from the point of view of PDI.  Going from the pentaho-mqtt-plugin model, the processRow() method is used only to determine a stop condition to signal to the framework.  Since there are no rows coming in, we use the init() method to start up a UDPReceiver from ggutils listening on the configured port.  The UDPReceiver has a handler method so that whenever a packet is received, it is passed to the handler.  The handler will unpack the packet into a row and call putRow().  The packets are handled on a thread pool, so we do potentially process fast arriving packets out of order.  When the number of packets received hits a maximum, or when a time limit expires, processRow() will return false to the framework stopping the transformation.  The UDPReceiver is stopped with a boolean flag, but just in case the thread does not stop for unknown reasons, it is hard-stopped after a join timeout to avoid leaking resources.

 

The settings UI dialogs are created using the Standard Widget Toolkit (SWT), which is a desktop UI library available for Java.  Note that there are many excellent custom widgets that already exist for use with Spoon, such as the table input widget.  To maintain the look and feel of PDI, it is best to work from an existing example dialog class from GitHub using these widgets.

 

Build and Deploy

 

In order to build the PDI transformation steps, import both of the above ggutils and kettle-udp projects into eclipse.  kettle-udp uses Maven, so make sure that the m2e eclipse extension is installed.  Finally, if you're not familiar with building Pentaho code using Maven, you should visit the GitHub repository Pentaho OSS Parent Poms for important configuration help.  (Big hint: you should use the settings.xml file given at the above repo in the maven-support-files in your ~username/.m2 folder to make sure your build tools can find the appropriate Pentaho repositories.)

 

Once everything is building, export a jar file called UDPReceiver.jar.  In order to see the new steps in Spoon, copy the jar file to the folder data-integration/plugins/steps/UDPReceiver.  The folder should have the same name as the jar file it contains.  You should be able to see the UDPReceiver step in the Input category and the UDPSender step in the output category when you start up Spoon the next time.

 

i18n and l10n

 

Pentaho contains facilities for internationalization and localization of strings and date formats.  The repository above contains en_US strings, but alternates could easily be provided depending on the language/locale of your choice.  In order to provide for localized strings in the application, simply include a messages package in your project structure (it should match the name of the target package plus ".messages") containing a messages_en_US.properties file.  The file should contain properties like "UDPSenderDialog.GeneralTab.Label=General", and the dialog code retrieves the string with a call to BaseMessages.getText(class, key), where class is a Java class in the target package, and key is a string in the properties file.  So for example,  BaseMessages.getString( UDPSenderMeta.class, "UDPSenderDialog.GeneralTab.Label") will return the string "General" if the locale in Spoon is set to en_US.

Note: to troubleshoot this during development, make sure that the messages properties file is actually deployed in the jar file by inspection with an archiver like 7Zip, and that the properties files are included in the build CLASSPATH.  Also you may have to change the locale back and forth away from and back to the desired locale in Spoon to flush cached strings.

 

Conclusions

 

Data sent via the internet is generally sent using either TCP or UDP protocols.  In this blog post, we looked at the characteristics of UDP, and highlighted the basic techniques for handling UDP packets in Java.  Then we showed how those techniques could be incorporated into a PDI transformation step and provided a reference implementation (for educational purposes only.)

 

In part II of this post, we will put these ideas into action with some demonstrations using Kettle transformations.

Joao Gameiro

Viz-SVG

Posted by Joao Gameiro Employee Dec 22, 2017

Introduction

 

Charts are a good way to visualize our data, but in some cases they are not enough.

Imagine you need to show a vehicle current status, based on his sensor data, don't you think the image below is more suitable than a table or chart?

 

 

  

The intention is to expose in the easiest way the information needed to support future actions.

 

Despite all possibilities to do this using the available methods, the idea of using an SVG image, which could be changed according to the data being received, sounded really cool.

 

As you might know, SVG images are basically XML files with four interesting characteristics (at least the ones that matter for this component):

  • are scalable,
  • normally end up with file sizes shorter than similar image formats,
  • can have identifiable elements on their structure,
  • modifiable once injected into the HTML DOM, via javascript.

 

Merging the approach used on SVG component and VIZ-APIVIZ-SVG was born.

 

Requirements

  • Javascript knowledge
  • Pentaho 8.0
  • npm

If you don’t have the node.js (npm) already installed follow this instruction:

https://docs.npmjs.com/getting-started/installing-node

How to build your own visualization using Viz-SVG project

 

Download the git folder example.

 

Extract the zip file, the necessary files are already there:

  • SVG image – circle.svg
  • The Model – yourmodel.js
  • Dummy data – data.json

 

circle.svg

 

<g>
  <title>SVG Example</title>
  <rect fill="#fff" id="canvas_background" height="402" width="582" y="-1" x="-1"/>
  <g display="none" overflow="visible" y="0" x="0" height="100%" width="100%" id="canvasGrid">
   <rect fill="url(#gridpattern)" stroke-width="0" y="0" x="0" height="100%" width="100%"/>
  </g>
 </g>
 <g>
  <title>Circle</title>
  <ellipse id="Sales" fill="lightgray" stroke="black" ry="70" rx="70"  cy="195" cx="281" stroke-width="5" />
 </g>
</svg>

 

data.json
  • svgElementID - the element name on circle.svg
  • value - bussines data

 

{
  "model": [
         {"name": "svgElementID", "type": "string", "label": "svg Element ID"},
         {"name": "value", "type": "number", "label": "Value"}
  ],
  "rows": [
         {"c": [ "Sales",  11000000]}
  ]
}
yourmodel.js

 

var SvgModel = BaseModel.extend({
     $type: {
          // SVG Label and Class
          styleClass: "pentaho-visual-samples-svg",
          label: "SVG Circle"
     },

     getSvgPath: function() {
          return "./circle.svg"; 
     },

     getSvgPartforDataID: function(dataId){
          return ["Stroke", "Fill"];
     },

     toSvgId: function(dataId, prop){
          return dataId ;
     },

     toSvgAttribute: function(dataId, prop, dataValue) {
          switch(prop){
               case "Stroke": return "stroke";
               case "Fill": return "fill";
          }
          return prop;
     },

     toSvgValue: function(dataId, prop, dataValue) {
          switch(prop){
               case "Stroke": return dataValue > 11000000 ? "red" : "green";
               case "Fill": return dataValue > 11000000? "green" : "red";
          }
     }
});






 

Pay attention to:

  • Label, this is the name that will appear in the list of available visualizations in Analyzer
label: "SVG Circle"

 

  • Path to your image
getSvgPath: function() {
     return "./circle.svg"; 
},

 

  • Properties that will be changed for each SVG element, (Stroke and Fill)
getSvgPartforDataID: function(dataId){
     return ["Stroke", "Fill"];
},

 

  • SVG id - if you need to change the SVG element id add some logic to this function
    • Take a look at the other example (modelProductLine.js)
toSvgId: function(dataId, prop){
     return dataId ;
},

 

  • Mapping between properties defined previous and SVG style properties
toSvgAttribute: function(dataId, prop, dataValue) {
     switch(prop){
          case "Stroke": return "stroke";
          case "Fill": return "fill";
     }
     return prop;
},

 

  • Define your business logic
toSvgValue: function(dataId, prop, dataValue) {
     switch(prop){
          case "Stroke": return dataValue > 11000000 ? "red" : "green";
          case "Fill": return dataValue > 11000000? "green" : "red";
     }
}

 

 

Now it's time to test our visualization

Register our files on index.html (Index.html purpose is to render our code on the browser.)

  • data.json is our dummy data
  • yourModel is our configuration

 

require([
        "pentaho/type/Context",
        "pentaho/data/Table",
        "json!./sales-by-product-family.json"
      ], function (Context, Table, dataSpec) {


        // Setup up a VizAPI context.
        Context.createAsync({ application: "viz-api-sandbox" })
          .then(function (context) {
            // Get the model and base view types
            return context.getDependencyAsync({
              BarModel: "pentaho/visual/samples/bar/modelProductLine",
              BaseView: "pentaho/visual/base/view"
            });
          })



 

Change:

line 04 - "json!./sales-by-product-family.json"   to    "json!./data.json"   and

 

line 13 - BarModel: "pentaho/visual/samples/bar/modelProductLine",  to   BarModel: "pentaho/visual/samples/bar/yourModel",

 

Let’s open index.html on browser.

Attention: directly opening the file through the filesystem will not work when using Google Chrome (and possibly other browsers), because of security restrictions that disallow the loading of local resources using XHR — a functionality that is required by the VIZ-SVG to load localization bundles and other resources.

To overcome this limitation we will use Node.js to serve the project files through an HTTP server.

 

Node:

npm install -g node-static

static -p 8001

 

Now you can open http://127.0.0.1:8001/index.html in the browser and test.

 

Change the values of data.json to modify colors

Edit data.json and change 11000000 to -> 11000001.

 

Refresh your browser

Colors should change (if not, clear browser cache).

 

Edit your model until your visualization is OK.

 

Create the Pentaho Web Package

Register your model on config.js and package.json

  • Config.js
    •      add the following line: module.id.replace(/(\w+)$/, "yourModel")(\w+)$/, "yourModel")
select: {
          type: [
               module.id.replace(/(\w+)$/, "modelGeneric"),
               module.id.replace(/(\w+)$/, "modelProductLine"),
               module.id.replace(/(\w+)$/, "yourModel")
  ],

 

  • package.json
    • add the following line: "pentaho-visual-samples-svg/yourModel": {"base": "pentaho/visual/base/model"}
 "pentaho/typeInfo": {
      "pentaho-visual-samples-svg/modelProductLine": {"base": "pentaho/visual/base/model"},
      "pentaho-visual-samples-svg/yourModel": {"base": "pentaho/visual/base/model"}
    },

 

Let's create our .tgz file

 

Note: By using the npm pack command you ensure only your files and bundled dependencies are compressed in the resulting tgz file.

 

Run npm pack on viz-svg-sandbox folder.

 

Deploy on Pentaho server and PDI

 

Copy the pentaho-visual-samples-svg-0.0.1.tgz to

  • pentaho-solutions/system/karaf/deploy/ (Pentaho Server)
  • data-integration/system/karaf/deploy   (PDI)

 

See the visualization on Pentaho BA server and PDI

 

Create a new Analysis Report using SampleData

  • Choose
    • Measure - Actual
    • Category – Department

 

Pick on the list, right side, our SVG Circle visualization.

 

 

 

Do the same on PDI (DET)

  • Choose
    • Measure - Actual
    • Category – Department

 

 

 

 

Our .tgz has another extra visualization: SVG Product Line.

 

 

 

 

Marketplace

You can find more examples in Pentaho Marketplace, search by VIZ-SVG, and install both.

 

 

 

Why use Viz-SVG

 

Viz-SVG can be a good way to visualize your IoT data during all pipeline process, PDI, Analyzer and CTools dashboards.

Example: you may need to check the occupancy on your office floor, what best than an SVG office layout.

 

 

Enjoy and share your visualizations.

SVG Component

Posted by Joao Barbosa Employee Dec 22, 2017

Introduction

Born from the need to simplify data consumption on dashboards, this component's goal is to provide more flexibility by leveraging SVG rendering capabilities.

 

The Spark and the Idea

The design of a particular dashboard to show a vehicle current status, based on his sensor data, required the use of the next image.

vehicle_view.png

The intention was to expose in the easiest way the information needed to support future actions.

 

Despite all possibilities to do this using the available CTools components and methods, the idea of using an SVG image on the dashboard, which could be changed according to the data being received, sounded really cool.

 

As you might know, SVG images are basically XML files with four interesting characteristics (at least the ones that matter for this component): they are scalable, they normally end up with file sizes shorter than similar image formats, they can have identifiable elements on their structure, and they are modifiable once injected into the HTML DOM, via javascript.

 

So, to show something similar to the designed data visualization for our dashboard we only needed to produce the SVG image (badly represented on the people's icon below), with ids on the elements that we wanted to change, gather the information that each sensor was sending, and modify the SVG layout (colors and/or text), using the required business rules.

people_basic.pngpeople_orange.png
people_basic_code.pngpeople_orange_code.png

 

How to Use

With this in mind and a couple more requirements, we ended up by creating the SVG component's first version that is available through Pentaho Marketplace.The component is quite simple, you can add it to your dashboard layout just like any other component, and its use can be controlled mostly be these 4 parameters:

 

  • URL - the relative path to the SVG image that will be used to show your data
  • Keep Image - the indication if an SVG reload should be performed (or not) when a dashboard refresh is required by the user
  • CallBackDraw - a courtesy function that will be executed after all the SVG changes are performed
  • Data source - The information that will be used to manage all the modifications necessary to show your data current status

 

Datasource Format and other requirements

In fact, everything is going to be managed by the information on your component's data source. Each record will represent a new SVG modification and is characterized by the following 4 attributes (the order is not significant but the names are):

 

  • svgElementID - the SVG element id where we want the component to perform changes
  • changeType - the kind of change the component should execute on the current element. This first version allows style (any style attribute common on CSS styling can be applied to SVG images) and text updates
  • styleProperty - the style property that will be added/changed. On text type operations this attribute is not mandatory
  • value - the value that will be used on the style property or the text update.

 

CDA_example.png

CDA example - data source record to perform the people icon's change illustrated earlier

 

Summing up, the process to use this component should be something like this:

  • Install it from Pentaho Marketplace (server restart will be needed). The component's use was only tested on Pentaho 7.0 versions or higher.
  • Review the component's samples that will be available after installation on plugin-samples / sample-svg-component
  • On a requireJS dashboard, add your layout placeholder, like for any other component, add the SVG component and start to fill the component's parameters
  • create and join your data source with the component specification
  • test it on browsers that support HTML 5.0 and SVG rendering (and pay attention that your audience should also have this type of browsers)

 

Where and when to use it

Whenever you sense that data readability is improved by using a specific image, and you are about to show it on a CTools dashboard, consider the SVG component. Remember, you can always show sensor data on a table or a particular chart, but the data coming from car sensors will always be more easy to read if we can show it on a related image.

 

Related work

To spread the use of these visualizations in the Pentaho product family, a merge between Viz-API and SVG capabilities is also available. This new artifact, called Viz-SVG, will allow you to use this type of visualizations on Pentaho Analyzer, Pentaho Data Integration DET and also CTools.

JDBC SQL logging

Posted by Kleyson Rios Employee Dec 19, 2017

Introduction

 

A very typical use case for SQL logging is to troubleshoot various database related issues during development.

 

For this purpose, the P6Spy is a useful library to log all SQL statement and parameter values before send it to database. The P6Spy is free, it’s use to intercepts and logs all your database SQL statements into a log file, and it works for any application that uses JDBC driver.

 

If your application uses a DataSource, simply wrap your current DataSource object with P6DataSource. P6DataSource has a constructor method that accepts the DataSource to wrap. This is by far the simplest method especially if you use a dependency injection framework such as Spring or Guice.

 

If your application obtains connections from DriverManager, simply modify your JDBC connection URL to include p6spy: . For example, if your URL is jdbc:mysql://host/db then just change it to jdbc:p6spy:mysql://host/db . P6Spy implements the JDBC 4.0 API allowing automatic registration of our JDBC driver with DriverManager. As long as your application obtains connections through DriverManager, you only need to modify your database connection URL to activate P6Spy.

 

Using it

 

Pentaho can be configured to leverage on the P6Spy logging capability as well.

 

To demonstrate its use, let's configure the P6Spy to log all the Sql queries passing thru the SampleData datasource.

 

1. Download the P6Spy .zip file and unzip it.

 

2. Copy into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder the p6spy-x.x.x-log4j-nodep.jar and p6spy-x.x.x.jar files.

 

3. Edit the pentaho-server/tomcat/webapps/pentaho/WEB-INF/classes/log4j.xml file and append at the end of the file, before the tag </log4j:configuration>, the following configuration:

 

   <!-- ========================================================= -->
   <!-- Special Log File specifically for p6spy                   -->
   <!-- ========================================================= -->

   <appender name="SPYLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="../logs/spy.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>


   <category name="p6spy">
      <priority value="DEBUG"/>
      <appender-ref ref="SPYLOG"/>
   </category>

 

NOTE: If the spy.log file is not created after the start of the Pentaho server, update the File param with the absolute path.

 

4. Create the pentaho-server/tomcat/lib/spy.properties file and insert the following configuration as its content:

 

driverlist=org.hsqldb.jdbcDriver
appender=com.p6spy.engine.spy.appender.Slf4JLogger

 

5. Edit the pentaho-server/tomcat/webapps/pentaho/META-INF/context.xml and add the following configuration for the new P6Spy datasource:

 

<Resource name="jdbc/DSspy" auth="Container" type="javax.sql.DataSource"
   factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" maxActive="20" minIdle="0" maxIdle="5" 
   initialSize="0" maxWait="10000" 
   username="pentaho_admin" password="password"
   driverClassName="com.p6spy.engine.spy.P6SpyDriver" 
   url="jdbc:p6spy:hsqldb:hsql://localhost:9001/SampleData"
   validationQuery="select count(*) from INFORMATION_SCHEMA.SYSTEM_SEQUENCES"/>

 

6. Restart the Pentaho server.

 

7. Logs in with Admin in the PUC and click on the Manage Data Sources button.

 

Screen Shot 2017-12-18 at 10.23.26.png

 

8. Edit the datasource SampleData/JDBC.

 

Screen Shot 2017-12-18 at 10.24.24.png

 

9. Change the Access type from Native(JDBC) to JNDI, and set the JNDI Name to DSspy and Test the connection.

 

Screen Shot 2017-12-18 at 10.28.06.png

 

Screen Shot 2017-12-18 at 10.28.53.png

 

10. Test it. Open the CTools Dashboard under the Public/Steel Wheels folder and/or create a new Analysis Report to use the SteelWheelsSales cube.

 

Screen Shot 2017-12-19 at 08.59.46.png

 

Screen Shot 2017-12-19 at 09.03.15.png

 

 

11. Verify all the SQL queries being logged in the spy.log file.

 

Screen Shot 2017-12-19 at 09.06.30.png

 

Where:

  • current time - the current time is obtained through System.getCurrentTimeMillis() and represents the number of milliseconds that have passed since January 1, 1970 00:00:00.000 GMT. (Refer to the J2SE documentation for further details on System.getCurrentTimeMillis().) To change the format, use the dateformat property described in Common Property File Settings.
  • execution time - the time it takes in milliseconds for a particular method to execute. (This is not the total cost for the SQL statement.) For example, a statementSELECT * FROM MYTABLE WHERE THISCOL = ? might be executed as a prepared statement, in which the .execute() function will be measured. This is recorded as the statement category. Further, as you call .next() on the ResultSet, each .next() call is recorded in the result category.
  • category - You can manage your log by including and excluding categories, which is described in Common Property File Settings.
  • connection id - Indicates the connection on which the activity was logged. The connection id is a sequentially generated identifier.
  • statement SQL string - This is the SQL string passed to the statement object. If it is a prepared statement, it is the prepared statement that existed prior to the parameters being set. To see the complete statement, refer to effective SQL string.
  • effective SQL string - If you are not using a prepared statement, this contains no value. Otherwise, it fills in the values of the Prepared Statement so you can see the effective SQL statement that is passed to the database. Of course, the database still sees the prepared statement, but this string is a convenient way to see the actual values being sent to the database.

 

Caveats

 

  • For a production environment, check out the impact on the performance.
  • SQL queries will not be logged if the data was already loaded in the cache.
  • No tests was done for scheduled tasks.

 

Multi-tenant Environment

 

For multi-tenant environment you might have the need to have the logged user on the log file as well. For that, you can also supply your own log message formatter to customize the format, including new data on it. Simply create a class which implements the com.p6spy.engine.spy.appender.MessageFormattingStrategy interface and place it on the classpath.

 

Customising the Log Message

 

1. Copy into the pentaho-server/tomcat/webapps/pentaho/WEB-INF/lib folder the attached P6SpyCustomMessage-1.0.0.jar file.

 

2. Append the to pentaho-server/tomcat/lib/spy.properties file following configuration:

 

 

logMessageFormat=com.pentaho.krios.p6spy.message.formatting.PentahoCustomMessage

 

 

3. Restart the Pentaho Server.

 

4. Execute again the Reports/Dashboards and check out the new log message.

 

Screen Shot 2017-12-19 at 09.39.33.png

 

 

Enjoy !

This article was co-authored with Benjamin Webb

 

Foundational to any map—whether it be a globe, GPS or any online map— is the functionality to understand data on specific locations. The ability to plot geospatial data is powerful as it allows one to distinguish, aggregate and display information in a very familiar manner. Using Pentaho, one can use shape files to plot areas on a map within a dashboard and then explore data geospatially. In this example, we can use C*Tools and Pentaho Data Integration to examine geographic spread of crime occurring in the city of Chicago.

 

Getting and Preparing Shapefiles

 

There are many popular formats of shapefiles. The most popular is the format developed and regulated by ESRI. Many geographic analytics packages produce this format, so it is relatively easy to find shapefiles for common geographic boundaries, including Country, State/Province, County, Postal Code, Political boundaries and more. For this analysis, we’ll use data provided by the Chicago Data Portal.

 

First, to get the shapefiles, we will download the Community Area Boundaries datafile in ESRI format. To use in Pentaho, we will prepare the shapefile by converting ESRI to GeoJSON. We will use a command line tool provided by GDAL titled ogr2ogr. More information on this suite of tools can be found on their website and on GitHub. To execute this tool we can use the following PDI transformation that will call ogr2ogr.exe with parameters including GeoJSON for the destination filetype as well as the source and destination files.

From this process, important information is collected on the 76 community areas in Chicago. As seen below in a small sample of the GeoJSON file created, information is contained for the community of Douglas including a series of latitudes and longitudes representing the points that form a polygon.

 

{

"type": "FeatureCollection",

"name": "geo_export_80095076-0a6b-4028-a365-64ec9f0350d7",

"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [

{ "type": "Feature", "properties": { "perimeter": 0.0, "community": "DOUGLAS", "shape_len": 31027.0545098, "shape_area": 46004621.158100002, "area": 0.0, "comarea": 0.0, "area_numbe": "35", "area_num_1": "35", "comarea_id": 0.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -87.609140876178913, 41.84469250265397 ], [ -87.609148747578061, 41.844661598424025 ], [ -87.609161120412566, 41.844589611939533 ]…

 

Next, we can get crime data from the same site. This very large file contains diverse information on every crime occurring in Chicago since 2001 including the location, type of crime, date, community ID etc. To reduce the size of this file with over 6 million rows, we first run a quick PDI transformation that will group the crimes by year and community.

 

We need to join the crime dataset to the GeoJSON dataset we created as only that dataset contains the names of the community areas. Both datasets do share a common geographic code, allowing us to blend the data by community areas ID.

 

For this C*Tools dashboard, we will use another PDI transformation as the source of the data. As will be seen later, when the user selects a year on the dashboard, this transformation will be triggered to get the total number of crimes for every Chicago community area in a given year.  We will also get the max number of crimes for all neighborhoods, to be used later in the color gradient.

 

With this transformation, we now have a set of data that includes key metrics along with their corresponding geographic areas represented as polygons.

 

Create the Dashboard with NewMapComponent

 

Now to create the actual map in Pentaho, we will use the C*Tools Community Dashboard Editor (CDE). This will be done using the NewMapComponent under the Components View.

 

This powerful component utilizes a mapping engine that renders the world for free (choosing between OpenLayers or Google engines). Then, GeoJSON or KML is used to plot polygons above the map. In our case, the GeoJSON file outline the communities throughout Chicago will be mapped.

 

Much of the functionality will be used through JavaScript snippets in the Component Lifecycle steps, such as Pre-Execution and Post-Execution.

 

As a brief example here, we can load our GeoJSON file with the following JavaScript in our Pre-Execution stage:

 

// locate our GeoJSON file under resources. Note, the BA server does not
// recognize the .geojson extension, so we rename to .js
var getResource = this.dashboard.getWebAppPath() + '/plugin/pentaho-cdf-dd/api/resources';
var mapDef = '${solution:resources/geojson/community-areas-current-geojson.js}';

// here we pass in our GeoJSON file, and also specify the GeoJSON property
// to use as a polygon ID
this.shapeResolver = 'geoJSON';
this.setAddInOptions('ShapeResolver', 'geoJSON', {
     url: getResource + mapDef,
     idPropertyName: 'area_num_1'
});

 

This will plot the polygons, with 1 caveat. The NewMapComponent also takes a data source data source as input (see the properties tab). This data source must contain data that matches the IDs specified above in the GeoJSON file, and only those polygons for which data points with matching IDs exist will be rendered.

 

We can specify which columns from our data source to use as the ID in a snippet also in the Pre-Execution phase like so:

this.visualRoles = {
        id: 0,                            // communityarea
        fill: 1                           // crimes
};

Note, here we defined the column for id as the first column (index 0), and use the 2nd column (index 1) as the fill value (more below).

 

To load our pre-aggregated data and render it on our map, the Kettle transformation described above is used which takes a year parameter and then reads & filters the Aggregated-Crime-Counts.csv file.

 

This transformation ensures that the 1st column is the Community Area ID and the 2nd column is the # of crimes, to match our JavaScript above.

 

Finally, more JavaScript can be added to add additional dashboard features. For our heat map example, we want to vary the fill color based on # of crimes.

 

We've already linked the data with the code snippet above. The NewMapComponent has some defaults, but to ensure it works smoothly we can implement the fill function ourselves as follows, which is also implemented in the Pre Execution step:

// define the polygon's fill function manually based on the crimes/fill
// value incoming from the datasource
this.attributeMapping.fill = function(context, seriesRoot, mapping, row) {
        var value = row[mapping.fill];
        var maxValue = row[mapping.max];
        if (_.isNumber(value)) {
                 return this.mapColor(value,
                         0,                                        // min crimes
                         maxValue,                // max crimes from dataset in yr
                         this.getColorMap()     // a default color map of green->red
                 );
        }
};

 

The above function validates the incoming fill/crimes column, and then tweaks the default color map (green to red) and maps all values on a scale of 0 to the max number of crimes in a year (in 2015, this number was 17,336; coming from the western community of Austin, seen below). All values between will be somewhere on the gradient.

 

 

Another very useful function that can be implemented within the NewMapComponent is the tool tip, that will highlight information about a community area, displaying the community name and the total number of crimes, when hovered by a mouse. This is implemented in the Post Execution, again utilizing JavaScript.

 

function tooltipOnHover() {
    /* TOOLTIP ON MOUSE HOVER 2 */
    var me = this;
    /*
        ** Define events for mouse move
        **/
        this.mapEngine.map.events.register('mousemove', this.mapEngine.map, function (e) {
                 if (!_.isEmpty(me.currentFeatureOver)) {
                         var modelItem = me.mapEngine.model.findWhere({
                                  id: me.currentFeatureOver.id
                         });
                         $('#popupObj')
                                  .css('top', e.pageY -50)
                                  .css('left', e.pageX + 5)
                                  // html contained in popup
                                  .html(
                                          modelItem.attributes.data.area_name + 
                    '<br>Total Crimes: ' + modelItem.attributes.data.fill + ''
                                  );
                 }
        });
        this.mapEngine.map.events.register('movestart', this.mapEngine.map, function (e) {
                 $('#popupObj').fadeIn(500);
        });
        this.mapEngine.map.events.register('moveend', this.mapEngine.map, function (e) {
                 $('#popupObj').fadeOut(500);
        });    
}

 

 

 

Effectively analyzing geospatial data, especially when used with enhancement tools like NewMapComponent can be a very powerful tool. From this basic example, we can better understand how crime exists across a very large city and how that spread has changed over time. Using the polygons allows us to better group the data in order to gain valuable insight.

 

This approach is heavily indebted to Kleyson Rios's NMC-samples repository, which has similar examples and can also be zipped & uploaded for exploring the NewMapComponent.

 

The code for this example can be found on GitHub.

This post originally published by Chris Deptula on Tuesday, October 27, 2015

 

I recently attended the Strata-HadoopWorld conference in NYC.  I have been attending this conference for the past few years and each year a theme emerges.  A few years ago it was SQL on Hadoop, last year was all Spark.  This year there was a lot of buzz about streaming and continuous data ingestion. Although often presented as different topics, the line between these concepts is blurring. 

 

To simplify, streaming means working with queues of events that are constantly flowing via a messaging system. Continuous ingestion is used more broadly and is the process of continuously ingesting from any data source, whether that be processing a file as soon as it is received, monitoring the database for new inserts, or streaming from a message queue.

 

One of the challenges with continuous ingestion is making data immediately available for analytics in a performant manner. For smaller amounts of data this can often be done using RDBMS’; however, for Big Data more complexity is required. Cloudera is trying to solve some of these challenges with Kudu; however, Kudu is still in beta and should not be used in production. Shortly before Strata-HadoopWorld, I started working on a demo for PentahoWorld using an architecture that solves this problem.

 

Before we go any further, let’s take a step back and explain the use case for our Pentaho World demo of real time analytics. I wanted to build a system that ingested a stream of tweets about the conference in real time while also making them available for analytics. The analytics are presented using a Pentaho CDE dashboard and Mondrian cube for slicing and dicing in Pentaho Analyzer. For the dashboard I performed a bit of Natural Language Processing on the tweets to determine their sentiment, and then made both aggregate metrics and individual tweets available as soon as the tweet was received. For the Pentaho Analyzer slice and dice capabilities, I allowed the data to be slightly older (not real time), but did update the underlying cube frequently.

 

How did I do this? At a high level I built a Pentaho Data Integration (PDI) transformation that acted as a daemon. This transformation used the Twitter Streaming API to get a continuous stream of tweets with the #PWorld15 hashtag, and does not stop processing new data until the transformation is manually stopped. I then used the Stanford Sentiment Analysis plugin for PDI to calculate the Wilson Sentiment of the tweet, before writing this data into MongoDB.

 

Once the data was in MongoDB, I built my Pentaho CDE dashboard. Pentaho CDE has the ability to source data from PDI transformations.  I built a few PDI transformations that used the MongoDB input step with Aggregate Pipelines to calculate the metrics I needed. Using Pentaho transformations as the input saved me from having to write a lot of JavaScript. Then it was simply a matter of telling the CDE components to refresh every 10 seconds and I had built my dashboard!

 

For the cube, Pentaho Analyzer requires a SQL interface to the data. It is true, that this could be solved by using PDI’s data service capability, which allows you to use SQL to query the results of a PDI transformation. However, the “PDI as a data service” option has a limitation in that the output data from the transformation must be able to fit in memory -- not practical for Big Data. Instead, I extended the transformation that I used to stream the data from Twitter into MongoDB to also write the data to Hadoop/HDFS. This enabled me to expose the data via Impala into the SQL interface Pentaho Analyzer requires.

 

Brilliant!  But, there was a catch. In Hadoop, a file is not readable until the file handle has been closed. If I never close the file because my transformation is constantly running, then the data would never be available to query. I could have written a file for every record, but this would have resulted in way too many small files and would have been detrimental to performance. Instead, I used the Transformation Executor step in PDI to group the tweets into 10 minute chunks, and then wrote one file every 10 minutes.

 

Once I had all of this figured out and working, I tested and encountered yet another problem. Tweets have new lines and lots of special characters. Storing the data in a plain text file was going to be a challenge. The good news is there are better file formats for Hadoop than plain text files, such as Avro and Parquet. So, instead of using text files, I used the Pentaho Consulting Labs Parquet Output step. This had the dual benefit of easily handling new lines and special characters while also improving the underlying performance of Cloudera Impala queries.

 

Finally, I created a few Impala tables and a Mondrian cube, and solved my problem. I was able to stream tweets in real time through Pentaho to make them immediately available in a dashboard, and, with a modest delay, also have them available for slice and dice analytics.

 

 

 

This post originally published by Kevin Haas on Tuesday, July 14, 2015

 

When working with our clients, we find a growing number who regard their customer or transaction data as not just an internally leveraged asset, but one that can enrich their relationships with customers and supporting partners. They need to systematically share data and analytics outside their firewall.

 

One of the ways we help them achieve this goal is with Custom Data API's. In this article, we'll show you how to build Data API's using visual programming within the Pentaho Data Integration Interface. Along the way, we'll expose capabilities of the platform that are not always seen by the average user.

 

Inspecting The Field: Inside Pentaho Data Integration

For those who use Pentaho's Data Integration (PDI aka Kettle) platform regularly, you're probably aware of the amazing things you can do. For the uninitiated, PDI is a very powerful Extract-Transform-Load (ETL) technology that lets you access data from virtually any source, perform complex transformations, and push/load that data to virtually any destination. It's commonly used for integrating and shaping data into new forms optimized for analytical purposes (data warehouses, data marts, etc.).

 

However, Pentaho experts know that Kettle is more than an ETL platform. It is powered by an amazing, open, and flexible data management platform that can not only read and write data from relational databases, Hadoop, NoSQL databases, and Web API's, but it can also serve data to other applications.

 

Enter one of the most powerful (but maybe underutilized) components of the Pentaho Data Integration Platform: the Data Integration Server. It can power your API's.

 

Sowing The Seeds: The Data Integration Server

If you're an Enterprise Edition Pentaho customer, you're probably aware of the capabilities of the Data Integration Server. It's one of the foundational reasons why customers choose the Enterprise Edition of PDI. The Data Integration Server is a server application that allows you to schedule and run jobs and transformations created with PDIs "Spoon" visual development tool.

 

Community Edition users have a similar (but not as sophisticated) version of the Data Integration Server with the included Carte server. If you're a community edition user, it's easy enough to start up your own local Carte server on a port of your choosing. You could start Carte on port 8081 like this:

 

carte.sh 127.0.0.1 8081

or

carte.bat 127.0.0.1 8081

 

There are many configuration options for the Data Integration Server and Carte server that allow it to operate in all kinds of forms. See here for the current list of options.

 

Bringing It To Life: Calling The Transformation on the Server with ExecuteTrans

The first step to create a Data API is to develop a Pentaho Data Integration transformation that gathers data from virtually any source: RDBMS, Hadoop, NoSQL Databases, Web APIs, etc. Since transformations can be parameterized, you can apply filtering, sorting or any other customizable logic to your code. Again, this can all be developed using PDIs graphical development tool: Spoon.

 

Once you have installed and launched your Data Integration or Carte server, you can execute your transformations with a simple HTTP call using the ExecuteTrans method. For example, if you have Carte running on port 8081 and want to execute a transformation that is stored in /srv/pentaho/sample.ktr and accepts a parameter named "parm", you can simply call:

 

http://127.0.0.1:8081/kettle/executeTrans/?trans=/srv/pentaho/sample.ktr&parm=parmvalue

 

When you launch this URL within a browser, you wont see a lot of information in return. In fact, you'll get nothing back! But, if you look at the PDI logs on the server, you'll see that the sample.ktr transformation did run. This is all well and good -- you were able to launch a transformation on a server via an HTTP request. But, now you want it to return some data!

 

Growing a Transformation into an API: The Output To Servlet Option

To provide an output, the transformation must produce one or more rows of data through a configured output step. To send rows of data out as a response from the ExecuteTrans method, your transformation must pass this data to a Text File Output (or JSON Output) Step, configuring the step to "Pass output to servlet."

 

 

By checking the "Pass output to servlet" box, the transform will take the stream of data entering this step and push it out as your response. Note that the exact data contents of the stream are output as specified by the steps formatting options (e.g. fields, headers, separators). You'll want to configure depending on your needs.

 

Bringing it to Market: A Taste of Farmers Market Data In A Custom Data API

Lets look at an example. With harvest season upon us, more and more farmers markets are filling up with their bounty. But where can you find local farmers markets? Thankfully, the USDA provides an API that lets you search for nearby markets.

 

http://catalog.data.gov/dataset/farmers-markets-search 

 

This is a powerful set of data, but its not always easy to use the API (or any API, as provided). In this case, to get a market, you first provide a zip code. The API returns a list of markets with their associated IDs, which you then must use to query the API again in order to get detailed information on each market. Also, we'd like to handle the result data using JSON format and some elements of the API return comma separated values. Finally, we need to format the returned address a bit better, attaching latitude and longitude for mapping.

 

To begin, we'll create a transform that accepts a parameter called "zip", and does the following:

 

1. Generate Rows: Initialize with a row for Pentaho to process input parameters.

2. HTTP Client: Call the USDA API to get the list of markets with the input parameters.

3. JSON Output: Convert the JSON that comes back into a list of markets.

4. JSON Output: Extract the Market IDs from that list of markets.

5. Calculator: Format a URL string to re-query the API to get detailed market info for each market.

6. HTTP Client: Call the USDA API again for each of the markets to get the details.

7. JSON Output: Extract the detailed address info on the market.

8. HTTP Client: Call the Google Geocoding API to get the latitude and longitude.

9. JSON Output: Extract the latitude and longitude from Google's response.

10. Javascript: Clean up and format the Address and Market info from the source APIs into a nice JSON format.

11. Group By: Group all of the detailed Markets from the API into a single row.

12. Javascript: Format the final JSON.

13. Text File Output: Output the single data element as a response to the servlet to show in the server.

 

 

A lot of steps, for sure, but their configuration is relatively simple. This is what the transformation looks like in the Spoon IDE:

 

Once we have developed and tested our transformation using Spoon, we are ready to use our Carte server and the ExecuteTrans method. We open a browser and enter the following URL:

 

http://127.0.0.1:8081/kettle/executeTrans/?trans=/mnt/pentaho/pdi-ee/repository/api/farmers-market.ktr&zip=60634 

 

Note, we can even front end the Carte or DI server with Apache and do a little URL rewriting to make it a friendlier URL to call. We did this. So instead of the above executeTrans URL, we can call this:

 

http://127.0.0.1:8081/api/farmers-market?zip=60634

 

When you execute the API, it does all the work for you. It calls all the various child APIs, does all the cleanup and formatting, and returns delicious looking JSON, with all of the added items from the Transformation as a result. For example:

 

{ 
    market:[ 
       { 
          ami_id:"1010508",
          marketname:"Eli's Cheesecake and Wright College Farmer's Market",
          location:"6701 W. Forest Preserve Drive, Chicago, Illinois, 60634",
          address1:"6701 West Forest Preserve Drive",
          address2:"",
          city:"Chicago",
          state:"IL",
          postal_code:"60634",
          latitude:"41.9588993",
          longitude:"-87.7944428",
          schedule:[ 
             "06/12/2015 to 09/25/2015 Thu: 7:00 AM-1:00 PM"
          ],
          products:[ 
             "Baked goods",
             "Crafts and/or woodworking items",
             "Cut flowers",
             "Fresh fruit and vegetables",
             "Honey",
             "Canned or preserved fruits, vegetables, jams, jellies, preserves, dried fruit, etc.",
             "Maple syrup and/or maple products",
             "Poultry",
             "Soap and/or body care products"
          ]
       }
    ]
 }

 

Taking It Home: Building Your Own Data APIs

This is a fun illustration of how to pull data from various sources and craft it together into your very own API. All of this was done with with out-of-the-box Pentaho Data Integration capabilities, no extra parts required.

 

Clearly, you're not limited on what you can do with this capability. You can easily get data from virtually any source and wrap a custom data API around it, giving you new capabilities to leverage your existing infrastructure, and share data in new ways. Pentaho has a lot of capabilities inside. We know that once you take PDI home from the market and look inside the box, you'll be inspired.

 

If you're interested in seeing how the transformation works, including a look at the transformation that powers the example, just let us know.

This post originally published by Bryan Senseman on Wednesday, October 15, 2014

 

I'm a huge user of Mondrian, the high speed open source OLAP engine behind Pentaho Analyzer, Saiku, and more. While the core Mondrian engine is wonderful, there are times when it doesn't do what I need it to, or exactly what I expect it to. Take this special case that I've encountered at some of my OEM customers, a case I call "Abnormal Genealogy."

 

Mondrian Needs Certain Data To Show Hierarchies

Mondrian has multiple ways to define hierarchies; traditional, ragged and parent-child. This is the feature that allows you to roll up and roll down metrics for complex parent and child record combinations. Think organization hierarchies, product hierarchies and employee hierarchies. In many cases, you have the IDs and data to support this cleanly, the way Mondrian expects, where parents have IDs that are less than (e.g. 3 < 14) than their associated children IDs.

 

But having data the way Mondrian expects it isn't always the case, particularly with OEM customers. There can be cases where the IDs of the child are less than than IDs of the parent. For example, my latest client preloads their system with standard/common "items" and then allows the end customer to create new parents (groupings) of these items. As a result, the preloaded items end up with IDs that are less than the newly added, customer-specific parent items. This reasonable load pattern breaks the Mondrian assumption that parent IDs are always less than child IDs.

 

Parent-Child ID's Need to Follow Certain Rules

When I first saw this, I thought "This is an edge case." But then I saw it again. And again. Clearly, I had to come up with a repeatable way to solve this problem. Fortunately, the ubiquitous FoodMart database can be used to replicate this issue. It contains an employee hierarchy with parent-child relationships that are pre-populated to fit the Mondrian assumption: all parent IDs are less than child IDs.  

 

Out of the box, FoodMart works correctly, but what happens if we simulate the replacement of a supervisor? We need to insert a new supervisor record in the employee table. This results in the new supervisor record receiving a higher employee_id than the employees which will report to it. The simple assignment of IDs for new records causes this peculiarly common case: abnormal genealogy!

 

To demonstrate this FoodMart scenario, let's look at a Pentaho Analyzer report showing Senior Management and some reporting data before the supervisor replacement:

 

 

Sheri Nowmer, inclusive of herself, has 10 Senior Managers assigned to her,  which includes Roberta Damstra.   All is well and good since the IDs of the parent records (supervisors) are less than the IDs of the child records (managed employees). Now, let's say Roberta retires to sunny Orlando and a new VP is hired, Mr. New Guy, on 1/1/1998. A few simple SQL edits and a Mondrian cache clearing and we should be good to go. We'll assign Mr. New Guy's employee_id with the value 1206. 1206 is much larger than any other employee_id in our database so it will easily stick out when we are reviewing the data. It also follows the usual pattern of a new employee record being assigned an ID which is greater than the maximum ID used.

 

Running the same report again after clearing the Analyzer and Mondrian caches, provides us with this view:

 

 

As expected, Roberta is no longer listed. However, Mr. New Guy also does not appear in this view. We confirm that we inserted his record so that he reports to Sheri Nowmer and that all of Roberta's employees now report to him. So, what happened to Mr. New Guy? His "abnormal genealogy" is causing internal issues for Mondrian. That is, the fact that his ID is greater than the IDs of his reports is causing Mondrian to incorrectly traverse the reporting hierarchy.

 

Solving the Abnormal Geneaology Problem

So, how can we solve this problem? We could dig into the Mondrian source code. It's open source afterall. But, I'd prefer to leave revising the core of the Mondrian engine to the java wizards at Pentaho and in the community. Perhaps, we can pull off a bit of Pentaho Data Integration (PDI) magic with some minor data structure changes to overcome this gap faster and easier.

 

The solution here is to use PDI to generate an extra set of ordered parent-child IDs. Using PDI, we need to identify the correct hierarchical order and make this order available to Mondrian. The Transform shown below demonstrates the steps I used to re-sequence the IDs to the correct genealogical order. Using this approach, parents consistently have smaller IDs than all of their children “making Mondrian happy without the need to modify its core code." Since the solution is data-centric we're also better protected for future versions of Mondrian.

 

 

So where exactly is the "magic" needed to resequence the IDs? It's in the use of the Closure Generator step, utilizing the existing parent and child IDs (which works regardless of genealogy). In this case, we take the base data and then reorder the ultimate parent rows by distance, as provided from the Closure Generator. Notice (below) that the Mr. New Guy, employee_id number 1206, is now the ninth "oldest" employee, perfect for a parent!

 

 

The next part of the solution is to update the dimension table with both ordered IDs as they will be needed in the Mondrian hierarchy configuration. If you're maintaining this with a PDI Dimension Lookup/Update step, I suggest using the Punch Thru option for these two fields. 

 

The final bit of data engineering requires you to create a tweaked version of the closure table that uses the ordered parent and original child IDs. This allows the correct groupings for parent aggregation while linking back into the original fact table with the actual IDs.

 

 

Now that our data engineering tasks are complete, all we need is a small adjustment of  the hierarchy definition in the Mondrian schema. We simply modify the level definition to utilize the new ordered fields. These changes result in the following dimension definition. (Note we create new tables called employee2 and employee_closure_cube in order to maintain the original foodmart structures in case we want to revert to the out-of-the-box configuration later.)

 

 

Back to Analyzer we go, refreshing both caches and...sure enough; Mr. New Guy is official!

 

 

When I first encountered this problem, even Google wasn't able to find an effective solution. I only found a few discussions of the problem in semi-related jiras: http://jira.pentaho.com/browse/MONDRIAN-1328 and http://jira.pentaho.com/browse/MONDRIAN-1511. Given my experience with hierarchical data, I expect that this is a fairly common case. I hope you find this approach helpful.

This post originally published by Chris Deptula on Wednesday, November 19, 2014.

 

Many of you requested more information on the inner workings of the Sqoop component. Perhaps the best way to explain is via "lessons learned". Here goes...

 

Use the split-by Option

Sqoop is primarily used to extract and import data from databases into HDFS and vice versa. This was exactly what HDI needed to do. So, I pulled up the Sqoop user manual for a quick refresher. I quickly found the import-all-tables Sqoop option. It seemed like the perfect solution. One command and I can extract every table from the datamart to HDFS. Great! So I tried it...

 

Almost immediately, Sqoop failed. It turns out that Sqoop, by default, extracts data from tables in chunks, and the chunking is done using each table's primary key. If a table has a single column primary key the extract works. If no primary key exists, it fails. In our experience, most dimension tables follow this design pattern -- using a single-column surrogate key. However, many, if not most, fact tables do not have a primary key, much less one stored in a single column. This was the case in our test database. When Sqoop tried to extract a fact table, the job failed. Time to start over.

 

It turns out Sqoop has a split-by option specifically designed to override the primary key default. Unfortunately, the split-by option cannot be used with import-all-tables since the split-by column will vary by table. This meant that HDI would have to generate table-specific import statements, each with their own split-by column. Simple enough. I could use Pentaho Data Integration to interrogate the database metadata and generate each table's specific Sqoop import command. Alas, once I began designing, I soon discovered complexity.

 

When using the split-by option, you should choose a column which contains values that are uniformly distributed. If there is a skew in the column data, then the chunks being processed will also be skewed causing overweighted import job tasks to potentially fail due to lack of compute resource (usually memory). Thus, picking the spit-by column required knowledge of table content that an automated process may not have. We'd have to make educated guesses.

 

Like the import-all-tables option, the easiest rule to apply is to use a single column primary key if it exists. We'd have to assume that the column is a surrogate key that was populated with a sequence of values. If no primary key column exists, we look for a date column. Our educated guess is that dates represent events and events tend to be more uniformly distributed than not. If no date column exists, we look for an integer column. If no integer, then we simply grab the first column we can find. Clearly this system is "version 1", with much potential for optimization. We have some ideas and plan on iterating the design in the coming months.

 

With the acknowledged fuzziness of our split-by algorithm, we knew that we'd have to enable HDI configuration so that it only generated code. You can use HDI to interrogate your source database metadata, generate the Sqoop import commands and, then, optionally execute those commands. This gave us more control to manually customize and execute the generated import commands.

 

Next, with my generated Sqoop import commands, I tried to run them.

 

Manage Database Fetch Size

The imports were working. I had loaded several dimensions and the first fact table. However, while importing my second, larger fact table, the MapReduce job failed with a GC Overhead limit exceeded error message. After a lot of Google searching and head scratching I determined that the source MySQL database was reading all the rows in one fetch and trying to store everything into memory. I had to tell MySQL to return the data in batches. The following parameters on my jdbc connection string did the trick.

 

 

?dontTrackOpenResources=true&defaultFetchSize=1000&useCursorFetch=true

 

My issue and solution were specific to MySQL, but I suspect that the same problem could occur for other RDBMS and jdbc drivers. You've been warned.

 

Convert Dates to Strings

With the fetch problem solved, my Sqoop imports finally completed. My next task was to define Hive tables on the newly loaded data and attempt some queries. All was fine until I selected some date fields. They were returned integers instead of dates...ugh! Being an experienced programmer, I quickly realized that the integers were "seconds since epoch." -- which most will agree is not a user friendly way to present dates! I dug through the Sqoop code and recognized that my date values were being treated as timestamps. To prevent Sqoop from converting dates to seconds since epoch, I had to manually map the datatype for each date field. I used Sqoop's map-column-java option to map date columns to the java String class. The generated Sqoop command had to include an option like this:

 

--map-column-java my_date=String

 

Once I made the change, the dates were output and queryable in yyyy-MM-dd HH:mm:ss.S format instead of seconds since epoch.

 

Success!

It took quite a bit of trial and error, but I was finally able to develop a Hadoop Datamart Importer. The core Sqoop import command generation module works like a champ! If you would like to learn more about how Pentaho Consulting's Data Engineering services can help you with your Hadoop data ingestion needs, give us a call.

This post originally published by Chris Deptula on Tuesday, February 24, 2015.

 

This is the third in a three part blog on working with small files in Hadoop.

 

In my previous blogs, we defined what constitutes a small file and why Hadoop prefers fewer, larger files. We then elaborated on the specific issues that small files cause, specifically inordinate NameNode memory usage and MapReduce performance degradation. We then began exploring common solutions to these problems. In this blog, I’ll close this series by examining some less commonly used alternatives for solving the MapReduce performance problem and the factors to consider when choosing a solution.

 

Solving the MapReduce Performance Problem

In my previous blog, I listed the following solutions for mitigating the MapReduce performance problem:

• Change the ingestion process/interval

• Batch file consolidation

• Sequence files

• HBase

• S3DistCp (If using Amazon EMR)

• Using a CombineFileInputFormat

• Hive configuration settings

• Using Hadoop’s append capabilities

 

My second blog already discussed changing the ingestion process, batch file consolidation, and sequence files. I will cover the remaining options here.

 

HBase

If you are producing numerous small files, storing your data as files in HDFS may not be the best solution. Instead you might consider using an HBase column store. Using HBase changes your ingestion process from producing many small HDFS files to writing individual records into HBase tables. If your data access pattern is based on well-defined, random-access lookups, HBase may be your best alternative. It is architecturally tuned for high-velocity data record inserts, high-volume, individual record lookups and streaming based analytics. However, if your data access pattern tends toward full file/table scans, then HBase may not be optimal.

 

It is possible to create a Hive table that maps to HBase data; however, query performance in this design will vary. Hive on HBase will shine when a single row or a range of rows is selected, but if your queries tend toward full table scans HBase is very inefficient. Most analytical queries, especially those using group bys, require full table scans.

 

HBase provides the best ability to stream data into Hadoop and make it available for processing in real time. However, balancing the needs of HBase with other cluster processes can be challenging and requires advanced system administration. Additionally, HBase performance is largely dependent on data access patterns and these should be carefully considered before choosing HBase to solve the small file problem.

 

S3DistCp

This solution is only available for users of Amazon EMR. Amazon EMR clusters are designed to be short lived, persisting their data in Amazon S3. Even with Amazon S3, processing a large number of small files still results in launching more map tasks than necessary -- decreasing performance. Enter S3DistCp...

 

S3DistCp is a utility provided by Amazon for distributed copying of data from S3 to ephemeral HDFS or even other S3 buckets. The utility provides the capability to concatenate files together through the use of groupBy and targetSize options. This is useful when you have thousands of small files stored in S3 that you want to process using Amazon EMR. S3DistCp kills two birds with one stone by concatenating many small files and making them appear in faster, ephemeral HDFS storage. There have been reports of as much as 15x performance improvement using this mechanism.

 

For all practical purposes S3DistCp does the same task as the batch file consolidation approach I mentioned in my previous blog. If using Amazon EMR, note that you have a pre-built tool that accomplishes this task.

 

Using a CombineFileInputFormat

The CombineFileInputFormat is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them “on the fly” for consumption by a single map task. You gain the benefits of not launching one map task per file and not requiring that multiple files be merged into a single persisted file as part of a preparatory step. This solves the problem of a MapReduce job launching too many map tasks; however, since the job is still reading in multiple small files random disk IO is still a problem. Additionally, most implementations of the CombineFileInputFormat do not account for data locality and will often pull data from a variety of data nodes over the network.

 

In order to implement this, the CombineFileInputFormat must be extended in Java for different file types. This requires significant development expertise to develop your custom input format classes. However, once written, you can configure a maximum split size and it will merge files until this size is met.

 

Note that since the merged data is not persisted in HDFS, the CombineFileInputFormat does not alleviate the NameNode memory problem. A good example of an implementation of a CombineFileInputFormat may be found here.

 

Hive Configuration Settings

If you notice that Hive creates small files in your Hadoop cluster through “create table as” or “insert overwrite” statements, you can adjust a few Hive specific configuration settings to mitigate. When used, these settings tell Hive to merge any small files that were created into larger files. However, there is a penalty. Hive will launch an additional MapReduce job, post-query, to perform the merge. Further, the merge is done before Hive indicates to the user that the query has finished processing instead of occurring asynchronously.

 

It should be noted that these settings only work for files that are created by Hive. If, for example, you create files outside of Hive using another tool such as Sqoop to copy into the Hive table using a hdfs fs –mv command, Hive will not merge the files. Therefore, this solution does not work when the files ingested into Hadoop are small. This solution is only recommended in Hive-centric architectures where a small performance penalty in insert overwrite and create table as statements is acceptable.

 

The settings to be used are:

 

 

Using Hadoop’s Append Capabilities

I would be remiss if I did not mention this final option. Invariably, when I discuss how to handle small files, I hear the question “I was reading a blog and saw that Hadoop has the ability to append to files. Why can’t we just append to existing files?”

 

The story of appends in Hadoop is rather rocky. Append was added in July of 2008 as part of Hadoop 0.19. However, after implementation (as early as October 2008) many issues were found and append was disabled in 0.19.1. However, in order to support HBase without risk of data loss append capabilities were added back to Hadoop in 0.20.2. So, finally, after 0.20.2 it was technically possible to perform appends in Hadoop.

 

Append may be available, but none of the major tools in the Hadoop ecosystem support it: Flume, Sqoop, Pig, Hive, Spark, and Java MapReduce. MapReduce enforces a rule that the output location of a MapReduce job must not exist prior to execution. Due to this rule it is obviously not possible for MapReduce to append to pre-existing files with its output. Since Sqoop, Pig, and Hive all use MapReduce under the covers it is also not possible for these tools to support append. Flume does not support append largely because it operates under the assumption that after a certain period either in terms of seconds, bytes, number of events, or seconds of inactivity, Flume will close the file and never open it again. The Flume community has deemed this sufficient and not demanded append support.

 

If you truly must use appends in Hadoop, you have to write your own system to perform the ingestion and append to existing files. Additionally, if any of your in-cluster processing requires appending to existing files, you will not be able to use Spark or MapReduce. Therefore using HDFS append capabilities is very complex and should only be used by the most technologically savvy organizations. Without a significant engineering team and support commitment, this option is not recommended.

 

Choosing a Solution

Choosing the best solution for working with small files depends on a variety of questions. It may be necessary to use a combination of these solutions based on access patterns and data requirements. The questions that should be considered include:

• At what point in the data flow are the small files being generated? Are the small files being created at ingestion, or via in-cluster processing?

• What tool is generating the small files? Can changing tool configuration reduce the number of small files?

• What level of technical skill exists within your organization? Do you have the capabilities to maintain input formats or writing your own ingestion engine?

• How often are the small files being generated? How often can small files be merged in order to create large files?

• What sort of data access is required to these small files? Do the files need to accessible through Hive?

• What type of administrative periods might exist where processes can be run inside the cluster to alleviate small files?

• What level of latency is acceptable from MapReduce processes?

 

Once you answer these questions, then you can evaluate and choose the best options.