One of the most heavily discussed topics in machine learning and data mining today is sentiment analysis. For the uninitiated, sentiment analysis is a goal to classify text as positive or negative based only on previously classified text. In this article, I will attempt to classify the sentiment of Twitter comments about a certain movie, based only on a dataset of 10,662 movie reviews, released in 2005. This solution will be demonstrated using 2 methods—once using only Pentaho Data Integration (with some R), and a more sophisticated solution will be built using Weka.

# Understanding the Naïve Bayes Classifier

Although many machine learning algorithms become very complex and difficult to understand very quickly, the Naïve Bayes classifier relies on one of the most fundamental rules in statistics, allowing its results to be highly interpretable, while also maintaining a high degree of predictive power. It is based upon Bayes’ Rule, which can be used to predict conditional probability. The equation reads:

*P("I"|negative)*can be described as the total number of times “I” appears in negative reviews, divided by the total number of words in negative reviews*P(negative)*is the total number of words that are in negative reviews divided by the total number of words in the training data*P("I")*is the total number of times “I” occurs in all reviews divided by the total number of words in the training data

We can then do the same above equation and replace the occurrences of negative with positive. Whichever probability is higher allows us to predict a movie review’s sentiment as negative or positive. The expectation would be that hated occurs significantly more often in the negative reviews, with the other terms being similar in both classes, thus allowing us to correctly classify this review as negative.

# Build a Naïve Bayes Model using Pentaho Data Integration

To build the model in Pentaho, there are a few steps involved. First, we must prepare the data by cleaning the data. Once this is done, we then build the terms for each word in the classifier. Lastly, we test the performance of the model using cross-validation.

## Step 1: Cleaning and Exploring the Source Data

To perform the sentiment analysis, we’ll begin the process with 2 input files—1 for negative reviews and 1 for positive reviews. Here is a sample of the negative reviews:

To clean the data for aggregation, punctuation is removed and words are made lowercase to allow for a table aggregated by class and word. Using the data explorer, we can start to see the word count differences for some descriptive words. These numbers intuitively make sense and help to build a strong classifier.

## Step 2: Building Terms for the Classifier

Next, we build the various terms for the classifier. Using the calculator steps, we need the probabilities and conditional probabilities for each word that occurs either in a negative review or positive review (or both) in the training data. The output from these steps then creates the parameters for the model. These need to be saved, so eventually they can be used against testing data. Here is a sample:

It can be noted that some of these word counts are null (zero). In the training data, this only occurs if a word count is zero for one of the two classes. But in the test data, this can occur for both classes of a give word. You will notice that the conditional probabilities for these null words are nonzero. This is because Add-1 Smoothing is implemented. We “pretend” that this count is 1 when we calculate the classifier, preventing a zero-out of the calculation.

To calculate the classifier for a given instance of a review, like the formula previously explained, we must apply the training parameters to the review—that is match each word in the review being classified with its probability parameters and apply the formula. It can be noted that when we solve the equation, we take the log of both sides because the terms being multiplied are very small.

## Step 3: Model Accuracy using Cross-Validation

You will notice there is a Section #3 on the transformation to see how well our classifier did. It turns out, that this is not the best way to check the accuracy. Instead, we will check the results using cross-validation. When building a model, it important not to test a model against the training data alone. This will cause overfitting, as the model is biased towards the instances it was built upon. Instead, using cross-validation we can re-build the model exactly as before, except only with a randomly sampled subset of the data (say, 75%). We then test the model against the remaining instances to see how well the model did. A subset of the predictions, with 4 correct predictions and 1 incorrect prediction, from cross-validation can be seen here:

Ultimately, using cross-validation, the model made the correct prediction 88% of the time.

# Test the Naïve Bayes Model on Tweets using Pentaho Data Integration and R

To read Tweets using R, we make use of two R libraries, twitteR and ROAuth. A more detailed explanation to create a Twitter application can be found here.

This allows for stream of tweets using the R Script Executor in PDI. We will test the model using Jumanji: Welcome to the Jungle, the movie leading the box office on MLK Jr. Day Weekend. Using the following code, we can search for recent tweets on a given subject. The twitteR package allows us to specify features, like ignoring retweets and using only tweets in English.

tweetStream = searchTwitter(‘Jumanji’ ,lang='en' ,n=100) dat = do.call("rbind", lapply(tweetStream, as.data.frame)) dat = dat[dat$isRetweet==FALSE,] review = dat$text Encoding(review) = "UTF-8" review = iconv(review, "UTF-8", "UTF-8",sub='') ## remove any non UTF char review = gsub("[\r\n;]", "", review)

Here is sample of the incoming tweets:

Clearly, these tweets are not of the same format as the training data of old movie reviews. To overcome this, we can remove all @ mentions. Most of these are unlikely to affect sentiment and are not present in the training data. We can also remove all special characters—this will treat hashtags as regular words. Additionally, we remove all http links within a tweet. To keep only tweets that are likely to reveal sentiment, we will only test tweets with 5+ words.

To get predictions, we now follow the same process as before, joining the individual words of a tweet to the training parameters and solve the classifier. Here is a sample of the results, along with my own subjective classifier:

Predicted Class | Subjective Class | Tweet |

negative | positive | I have to admit this was a fun movie to watch jumanji jumanjiwelcometothejungle action httpstcopXCGbOgNGf |

negative | negative | Jumanji 2 was trash Im warning you before you spend your money to go see it If you remember the first you wont httpstcoV4TfNPHGpC |

negative | positive | @TheRock @ColinHanks Well the people who have not seen JUMANJI are just wrong so |

positive | positive | Finally managed to watch Jumanji today Melampau tak kalau aku cakap it was the best movie I have ever watched in my life |

negative | negative | Is Jumanji Welcome to the Jungle just another nostalgia ploy for money Probably httpstcoDrfOEyeEW2 httpstcoRsfv7Q5mnH |

positive | positive | Saw Jumanji today with my bro such an amazing movie I really loved it cant wait to see more of your work @TheRock |

positive | positive | Jumanji Welcome to the Jungle reigns over MLK weekend httpstcoOL3l6YyMmt httpstcoLjOzIa4rhD |

One of the major issues with grabbing tweets based on a simple keyword is that many tweets do not reveal sentiment. Of the 51 tweets that were tested (the other 49 were either retweets or did not contain 5 words), I subjectively determined only 22 of them contained sentiment. The successful classification rate of these tweets is 68%. This is significantly less than the success rate in the cross-validation, but can be explained by the different use of language between the training set and the tweets. The slang, acronyms and pop culture phrasing used on Twitter is not prevalent in the movie review training data from 2005.

# Enhancing the Model with Weka:

The Naïve Bayes model can be greatly enhanced using Weka. Weka provides powerful features that can be applied within a simple interface and fewer steps. Using their pre-built classifiers, the parameters can be easily tuned. Here, Multinomial Naïve Bayes is used. First, the reviews are split by word, as required by Naïve Bayes, by using the StringToWordVector filter. Additionally, 10-fold cross validation is used. Instead of building the model once as we did before the data is randomly partitioned into 10 sets. The model is run 10 times, leaving 1 set out each time and then the ten models are averaged out to build the classifier. This model will reduce overfitting, making it more robust to the tweets.

Here is the output from the model:

When the tweets are scored using the PDI Weka scoring step, the subjective successful prediction rate increased slightly to 73%.