Given a corpus of tweets, how do you train a model to accurately predict sentiments of new tweets? This was one of the projects from my Machine Learning class at USF.
Here are three different tweets and their labels in the training set:
|negative||“didn’t get shit done today ~ i’m so screwed “|
|neutral||“Check this video out — President Obama at the White House Correspondents’ Dinner http://bit.ly/IMXUM”|
|positive||“Absolutly lovely day. Sunshine and everything… “|
Among the tweets, there were around 8000 positive tweets, 8000 negative tweets, and just 200 neutral tweets. For proprietary reasons, I can’t share the data set of tweets I had for this project, but I will share my code and describe my approach below.
You can get a baseline model that uses JUST the rates of certain English function words such as “I”, “the”, “and”, “to”, and “you”. My baseline model also included the rates of the following punctuation symbols in the tweet: “.”, “,”, and “!”. For example, the rate of “the” in the phrase “the quick brown fox jumped over the lazy dog” is 2 / 9 = 0.2222.
Once these rates were calculated, I split the data into a training set and test set. The model was built on the training set using three different algorithms – Logistic Regression, Linear Discriminant Analysis and k Nearest Neighbours. The model was then tested on the holdout/testing set and the Misclassification Rate was used to compare different models:
LDA performs the best here, and we get a misclassification rate of 0.41. So how do we do better?
A bag-of-words model is a simple and common approach used for document classification where the frequency of occurrence of each word is used as a feature for training classifiers. This approach represents each sentence as a bag of its words, disregarding grammar and even word order but keeping multiplicity.
In the baseline model above we were using only the frequencies of English function words, but as you will see below, we can do much better. To use a bag-of-words classifier, I first built a corpus using all the tweets available. After tokenizing each tweet using a tweet-specific tokenizer that captures emoticons, I used a stemmer and reduced words to their lexical roots.
Through my research and experimentation, I found out that the best stemmer for tweets turned out to be the “SnowballStemmer” (versus the Porter Stemmer and Lancaster Stemmer in the nltk package). This stemmer preserves emoticons as they are. After using this stemmer, I built a corpus on the roots and emoticons, and established them as input variables for the classifiers.
To further improve the model, we can also incorporate the order of words in the tweets. Word order is important, and tweets that begin with “I like” are probably positive, while tweets that begin with “I don’t like” are negative. A simple way to do this would be to use bigrams and trigrams as features. In the code, this is done by setting ngram_range= (1,3) in the vectorizer function.
The feature additions over the baseline can be summarized as follows:
- Stemming words to their lexical roots and using the same common root as a feature
- Preserving and recognizing emoticons using a tweet-specific tokenizer
- Considering bigrams and trigrams in addition to unigrams as features
In our final model, a tweet sentiment tag will depend on what words appear in the tweet, and on their word order (through the use of bigrams and trigrams).
These improvements bring down the misclassification rate from 0.41 to 0.22 when using logistic regression!
I would explore parts-of-speech tagging to further improve performance. Using Doc2Vec to encode documents (tweets) into vectors would also allow efficient feature creation.
You can find my code here.