Twitter Sentiment Analysis

Given a corpus of tweets, how do you train a model to accurately predict sentiments of new tweets? This was one of the projects from my Machine Learning class at USF.

Here are three different tweets and their labels in the training set:

negative  “didn’t get shit done today ~ i’m so screwed “
neutral  “Check this video out — President Obama at the White House Correspondents’ Dinner http://bit.ly/IMXUM”
positive  “Absolutly lovely day. Sunshine and everything… “

Among the tweets, there were around 8000 positive tweets, 8000 negative tweets, and just 200 neutral tweets. For proprietary reasons, I can’t share the data set of tweets I had for this project, but I will share my code and describe my approach below.

Baseline Model

You can get a baseline model that uses JUST the rates of certain English function words such as “I”, “the”, “and”, “to”, and “you”. My baseline model also included the rates of the following punctuation symbols in the tweet: “.”, “,”, and “!”. For example, the rate of “the” in the phrase “the quick brown fox jumped over the lazy dog” is 2 / 9 = 0.2222.

Once these rates were calculated, I split the data into a training set and test set. The model was built on the training set using three different algorithms – Logistic Regression, Linear Discriminant Analysis and k Nearest Neighbours. The model was then tested on the holdout/testing set and the Misclassification Rate was used to compare different models:

table1

LDA performs the best here, and we get a misclassification rate of 0.41. So how do we do better?

Improved Model

A bag-of-words model is a simple and common approach used for document classification where the frequency of occurrence of each word is used as a feature for training classifiers. This approach represents each sentence as a bag of its words, disregarding grammar and even word order but keeping multiplicity.

In the baseline model above we were using only the frequencies of English function words, but as you will see below, we can do much better. To use a bag-of-words classifier, I first built a corpus using all the tweets available. After tokenizing each tweet using a tweet-specific tokenizer that captures emoticons, I used a stemmer and reduced words to their lexical roots.

Through my research and experimentation, I found out that the best stemmer for tweets turned out to be the “SnowballStemmer” (versus the Porter Stemmer and Lancaster Stemmer in the nltk package). This stemmer preserves emoticons as they are. After using this stemmer, I built a corpus on the roots and emoticons, and established them as input variables for the classifiers.

To further improve the model, we can also incorporate the order of words in the tweets. Word order is important, and tweets that begin with “I like” are probably positive, while tweets that begin with “I don’t like” are negative. A simple way to do this would be to use bigrams and trigrams as features. In the code, this is done by setting ngram_range= (1,3)  in the vectorizer function.

The feature additions over the baseline can be summarized as follows:

  • Stemming words to their lexical roots and using the same common root as a feature
  •  Preserving and recognizing emoticons using a tweet-specific tokenizer
  • Considering bigrams and trigrams in addition to unigrams as features

In our final model, a tweet sentiment tag will depend on what words appear in the tweet, and on their word order (through the use of bigrams and trigrams).

tweets_table2

These improvements bring down the misclassification rate from 0.41 to 0.22 when using logistic regression!

Further Improvements

I would explore parts-of-speech tagging to further improve performance. Using Doc2Vec to encode documents (tweets) into vectors would also allow efficient feature creation.

Code

You can find my code here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s