Skip navigation


2 Posts authored by: Jesse Zuckerman Employee

One of the most heavily discussed topics in machine learning and data mining today is sentiment analysis. For the uninitiated, sentiment analysis is a goal to classify text as positive or negative based only on previously classified text. In this article, I will attempt to classify the sentiment of Twitter comments about a certain movie, based only on a dataset of 10,662 movie reviews, released in 2005. This solution will be demonstrated using 2 methods—once using only Pentaho Data Integration (with some R), and a more sophisticated solution will be built using Weka.

Understanding the Naïve Bayes Classifier

Although many machine learning algorithms become very complex and difficult to understand very quickly, the Naïve Bayes classifier relies on one of the most fundamental rules in statistics, allowing its results to be highly interpretable, while also maintaining a high degree of predictive power. It is based upon Bayes’ Rule, which can be used to predict conditional probability. The equation reads:

Applying Bayes’ Rule to sentiment analysis to classify a movie as bad given a specific review of “I hated it” would be:

The classifier is called “naïve” because we will assume that each word in the review is independent. This is probably an incorrect assumption, but it allows the equation to be simplified and solvable, while the results tend to hold their predictive power.
Applying Bayes’ Rule has allowed us to dramatically simplify our solution. To solve the above equation, the probability of each event will be calculated.
  • P("I"|negative) can be described as the total number of times “I” appears in negative reviews, divided by the total number of words in negative reviews
  • P(negative) is the total number of words that are in negative reviews divided by the total number of words in the training data
  • P("I") is the total number of times “I” occurs in all reviews divided by the total number of words in the training data


We can then do the same above equation and replace the occurrences of negative with positive.  Whichever probability is higher allows us to predict a movie review’s sentiment as negative or positive. The expectation would be that hated occurs significantly more often in the negative reviews, with the other terms being similar in both classes, thus allowing us to correctly classify this review as negative.


Build a Naïve Bayes Model using Pentaho Data Integration

To build the model in Pentaho, there are a few steps involved.  First, we must prepare the data by cleaning the data. Once this is done, we then build the terms for each word in the classifier. Lastly, we test the performance of the model using cross-validation.

Step 1: Cleaning and Exploring the Source Data

To perform the sentiment analysis, we’ll begin the process with 2 input files—1 for negative reviews and 1 for positive reviews. Here is a sample of the negative reviews:


To clean the data for aggregation, punctuation is removed and words are made lowercase to allow for a table aggregated by class and word. Using the data explorer, we can start to see the word count differences for some descriptive words. These numbers intuitively make sense and help to build a strong classifier.



Step 2: Building Terms for the Classifier

Next, we build the various terms for the classifier. Using the calculator steps, we need the probabilities and conditional probabilities for each word that occurs either in a negative review or positive review (or both) in the training data. The output from these steps then creates the parameters for the model. These need to be saved, so eventually they can be used against testing data. Here is a sample:


It can be noted that some of these word counts are null (zero). In the training data, this only occurs if a word count is zero for one of the two classes. But in the test data, this can occur for both classes of a give word. You will notice that the conditional probabilities for these null words are nonzero. This is because Add-1 Smoothing is implemented. We “pretend” that this count is 1 when we calculate the classifier, preventing a zero-out of the calculation.


To calculate the classifier for a given instance of a review, like the formula previously explained, we must apply the training parameters to the review—that is match each word in the review being classified with its probability parameters and apply the formula. It can be noted that when we solve the equation, we take the log of both sides because the terms being multiplied are very small.


Step 3: Model Accuracy using Cross-Validation

You will notice there is a Section #3 on the transformation to see how well our classifier did. It turns out, that this is not the best way to check the accuracy. Instead, we will check the results using cross-validation. When building a model, it important not to test a model against the training data alone. This will cause overfitting, as the model is biased towards the instances it was built upon. Instead, using cross-validation we can re-build the model exactly as before, except only with a randomly sampled subset of the data (say, 75%). We then test the model against the remaining instances to see how well the model did. A subset of the predictions, with 4 correct predictions and 1 incorrect prediction, from cross-validation can be seen here:



Ultimately, using cross-validation, the model made the correct prediction 88% of the time.


Test the Naïve Bayes Model on Tweets using Pentaho Data Integration and R

To read Tweets using R, we make use of two R libraries, twitteR and ROAuth. A more detailed explanation to create a Twitter application can be found here.


This allows for stream of tweets using the R Script Executor in PDI. We will test the model using Jumanji: Welcome to the Jungle, the movie leading the box office on MLK Jr. Day Weekend. Using the following code, we can search for recent tweets on a given subject. The twitteR package allows us to specify features, like ignoring retweets and using only tweets in English.


tweetStream = searchTwitter(‘Jumanji’ ,lang='en' ,n=100) 
dat ="rbind", lapply(tweetStream, 
dat = dat[dat$isRetweet==FALSE,] 
review = dat$text 
Encoding(review) = "UTF-8" 
review = iconv(review, "UTF-8", "UTF-8",sub='') ## remove any non UTF char 
review = gsub("[\r\n;]", "", review) 


Here is sample of the incoming tweets:



Clearly, these tweets are not of the same format as the training data of old movie reviews. To overcome this, we can remove all @ mentions. Most of these are unlikely to affect sentiment and are not present in the training data. We can also remove all special characters—this will treat hashtags as regular words. Additionally, we remove all http links within a tweet. To keep only tweets that are likely to reveal sentiment, we will only test tweets with 5+ words.


To get predictions, we now follow the same process as before, joining the individual words of a tweet to the training parameters and solve the classifier. Here is a sample of the results, along with my own subjective classifier:


Predicted Class

Subjective Class




I have to admit this was a fun movie to watch jumanji jumanjiwelcometothejungle action httpstcopXCGbOgNGf



Jumanji 2 was trash Im warning you before you spend your money to go see it If you remember the first you wont httpstcoV4TfNPHGpC



@TheRock @ColinHanks Well the people who have not seen JUMANJI are just wrong so



Finally managed to watch Jumanji today Melampau tak kalau aku cakap it was the best movie I have ever watched in my life



Is Jumanji Welcome to the Jungle just another nostalgia ploy for money Probably httpstcoDrfOEyeEW2 httpstcoRsfv7Q5mnH



Saw Jumanji today with my bro such an amazing movie I really loved it cant wait to see more of your work @TheRock



Jumanji Welcome to the Jungle reigns over MLK weekend httpstcoOL3l6YyMmt httpstcoLjOzIa4rhD


One of the major issues with grabbing tweets based on a simple keyword is that many tweets do not reveal sentiment. Of the 51 tweets that were tested (the other 49 were either retweets or did not contain 5 words), I subjectively determined only 22 of them contained sentiment. The successful classification rate of these tweets is 68%. This is significantly less than the success rate in the cross-validation, but can be explained by the different use of language between the training set and the tweets. The slang, acronyms and pop culture phrasing used on Twitter is not prevalent in the movie review training data from 2005.


Enhancing the Model with Weka:

The Naïve Bayes model can be greatly enhanced using Weka. Weka provides powerful features that can be applied within a simple interface and fewer steps. Using their pre-built classifiers, the parameters can be easily tuned. Here, Multinomial Naïve Bayes is used. First, the reviews are split by word, as required by Naïve Bayes, by using the StringToWordVector filter. Additionally, 10-fold cross validation is used. Instead of building the model once as we did before the data is randomly partitioned into 10 sets. The model is run 10 times, leaving 1 set out each time and then the ten models are averaged out to build the classifier. This model will reduce overfitting, making it more robust to the tweets.


Here is the output from the model:



When the tweets are scored using the PDI Weka scoring step, the subjective successful prediction rate increased slightly to 73%.

This article was co-authored with Benjamin Webb


Foundational to any map—whether it be a globe, GPS or any online map— is the functionality to understand data on specific locations. The ability to plot geospatial data is powerful as it allows one to distinguish, aggregate and display information in a very familiar manner. Using Pentaho, one can use shape files to plot areas on a map within a dashboard and then explore data geospatially. In this example, we can use C*Tools and Pentaho Data Integration to examine geographic spread of crime occurring in the city of Chicago.


Getting and Preparing Shapefiles


There are many popular formats of shapefiles. The most popular is the format developed and regulated by ESRI. Many geographic analytics packages produce this format, so it is relatively easy to find shapefiles for common geographic boundaries, including Country, State/Province, County, Postal Code, Political boundaries and more. For this analysis, we’ll use data provided by the Chicago Data Portal.


First, to get the shapefiles, we will download the Community Area Boundaries datafile in ESRI format. To use in Pentaho, we will prepare the shapefile by converting ESRI to GeoJSON. We will use a command line tool provided by GDAL titled ogr2ogr. More information on this suite of tools can be found on their website and on GitHub. To execute this tool we can use the following PDI transformation that will call ogr2ogr.exe with parameters including GeoJSON for the destination filetype as well as the source and destination files.

From this process, important information is collected on the 76 community areas in Chicago. As seen below in a small sample of the GeoJSON file created, information is contained for the community of Douglas including a series of latitudes and longitudes representing the points that form a polygon.



"type": "FeatureCollection",

"name": "geo_export_80095076-0a6b-4028-a365-64ec9f0350d7",

"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [

{ "type": "Feature", "properties": { "perimeter": 0.0, "community": "DOUGLAS", "shape_len": 31027.0545098, "shape_area": 46004621.158100002, "area": 0.0, "comarea": 0.0, "area_numbe": "35", "area_num_1": "35", "comarea_id": 0.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -87.609140876178913, 41.84469250265397 ], [ -87.609148747578061, 41.844661598424025 ], [ -87.609161120412566, 41.844589611939533 ]…


Next, we can get crime data from the same site. This very large file contains diverse information on every crime occurring in Chicago since 2001 including the location, type of crime, date, community ID etc. To reduce the size of this file with over 6 million rows, we first run a quick PDI transformation that will group the crimes by year and community.


We need to join the crime dataset to the GeoJSON dataset we created as only that dataset contains the names of the community areas. Both datasets do share a common geographic code, allowing us to blend the data by community areas ID.


For this C*Tools dashboard, we will use another PDI transformation as the source of the data. As will be seen later, when the user selects a year on the dashboard, this transformation will be triggered to get the total number of crimes for every Chicago community area in a given year.  We will also get the max number of crimes for all neighborhoods, to be used later in the color gradient.


With this transformation, we now have a set of data that includes key metrics along with their corresponding geographic areas represented as polygons.


Create the Dashboard with NewMapComponent


Now to create the actual map in Pentaho, we will use the C*Tools Community Dashboard Editor (CDE). This will be done using the NewMapComponent under the Components View.


This powerful component utilizes a mapping engine that renders the world for free (choosing between OpenLayers or Google engines). Then, GeoJSON or KML is used to plot polygons above the map. In our case, the GeoJSON file outline the communities throughout Chicago will be mapped.


Much of the functionality will be used through JavaScript snippets in the Component Lifecycle steps, such as Pre-Execution and Post-Execution.


As a brief example here, we can load our GeoJSON file with the following JavaScript in our Pre-Execution stage:


// locate our GeoJSON file under resources. Note, the BA server does not
// recognize the .geojson extension, so we rename to .js
var getResource = this.dashboard.getWebAppPath() + '/plugin/pentaho-cdf-dd/api/resources';
var mapDef = '${solution:resources/geojson/community-areas-current-geojson.js}';

// here we pass in our GeoJSON file, and also specify the GeoJSON property
// to use as a polygon ID
this.shapeResolver = 'geoJSON';
this.setAddInOptions('ShapeResolver', 'geoJSON', {
     url: getResource + mapDef,
     idPropertyName: 'area_num_1'


This will plot the polygons, with 1 caveat. The NewMapComponent also takes a data source data source as input (see the properties tab). This data source must contain data that matches the IDs specified above in the GeoJSON file, and only those polygons for which data points with matching IDs exist will be rendered.


We can specify which columns from our data source to use as the ID in a snippet also in the Pre-Execution phase like so:

this.visualRoles = {
        id: 0,                            // communityarea
        fill: 1                           // crimes

Note, here we defined the column for id as the first column (index 0), and use the 2nd column (index 1) as the fill value (more below).


To load our pre-aggregated data and render it on our map, the Kettle transformation described above is used which takes a year parameter and then reads & filters the Aggregated-Crime-Counts.csv file.


This transformation ensures that the 1st column is the Community Area ID and the 2nd column is the # of crimes, to match our JavaScript above.


Finally, more JavaScript can be added to add additional dashboard features. For our heat map example, we want to vary the fill color based on # of crimes.


We've already linked the data with the code snippet above. The NewMapComponent has some defaults, but to ensure it works smoothly we can implement the fill function ourselves as follows, which is also implemented in the Pre Execution step:

// define the polygon's fill function manually based on the crimes/fill
// value incoming from the datasource
this.attributeMapping.fill = function(context, seriesRoot, mapping, row) {
        var value = row[mapping.fill];
        var maxValue = row[mapping.max];
        if (_.isNumber(value)) {
                 return this.mapColor(value,
                         0,                                        // min crimes
                         maxValue,                // max crimes from dataset in yr
                         this.getColorMap()     // a default color map of green->red


The above function validates the incoming fill/crimes column, and then tweaks the default color map (green to red) and maps all values on a scale of 0 to the max number of crimes in a year (in 2015, this number was 17,336; coming from the western community of Austin, seen below). All values between will be somewhere on the gradient.



Another very useful function that can be implemented within the NewMapComponent is the tool tip, that will highlight information about a community area, displaying the community name and the total number of crimes, when hovered by a mouse. This is implemented in the Post Execution, again utilizing JavaScript.


function tooltipOnHover() {
    var me = this;
        ** Define events for mouse move
        **/'mousemove',, function (e) {
                 if (!_.isEmpty(me.currentFeatureOver)) {
                         var modelItem = me.mapEngine.model.findWhere({
                                  .css('top', e.pageY -50)
                                  .css('left', e.pageX + 5)
                                  // html contained in popup
                    '<br>Total Crimes: ' + + ''
        });'movestart',, function (e) {
        });'moveend',, function (e) {




Effectively analyzing geospatial data, especially when used with enhancement tools like NewMapComponent can be a very powerful tool. From this basic example, we can better understand how crime exists across a very large city and how that spread has changed over time. Using the polygons allows us to better group the data in order to gain valuable insight.


This approach is heavily indebted to Kleyson Rios's NMC-samples repository, which has similar examples and can also be zipped & uploaded for exploring the NewMapComponent.


The code for this example can be found on GitHub.