Predicting Customer Sentiment

It wasn’t too long ago that sentiment analysis was simply a matter of parsing words from a text and matching them to a positive or negative word lexicon. Although this is a useful way of quantifying sentiment, it does not work in predicting someone’s sentiment. In this post we are going to see how to parse the words from various tweets and convert these to variables that will be used in a predictive model.

First we will import a csv file with a list of 15,000 tweets about various airlines. These tweets manually have been categorized as positive or negative.

Example of CSV

After importing, we convert the file to a data frame called df.

The tm library in R is fantastic. Pretty much anything you want to do with text analysis can be found here from converting it to a corpus, removing punctuation, certain words, getting rid of numbers, etc. In the code below you can see that we are removing stop words (a, the, at, or, and, etc.) along with certain words that we don’t want to use in the model such as “United”, etc. because these will not help with the prediction.

After this we put the corpus into a Document Term Matrix (DTM) which is a list of individual words. The removeSparseTerms function allows us to return only the more common terms. You can adjust the thresholds by trial and error. In our case, the 0.99 threshold gives us a result of 153 words.

Along with common individual words, there are common combinations of 2,3, and 4 words. These are called ngrams. Just as with the single ngram we want to adjust the thresholds on the removeSparseTerms function on an iterative basis.

As you can see in the results above, there are 19 results of 2 word combinations, 32 of 3 word, and 4 of 4 word. Below are what some of them look like.

These are all combined into one data frame. Before we are ready to start the modeling we need to delete an additional word “next” because this will throw off the formula. We will rename it as “nextt”.

Now it’s time to create training and validation sets. We will start with just the positive tweets for now and will break this into a 60/40 split. The as.formula function allows us to create the formula to be used in the model without having to manually enter each independent variable.

Finally we’ll run the logistic regression model, run the ROC curve output, and put the predictions into deciles.

Now we can look at the output of the CSV file for our validation. As you can see below the model does a pretty good job of predicting if a tweet is positive. In the top decile, we are predicting this correctly 74% of the time.