The results of the analysis made in the last post, are found on dataset. But now, my goal is to have these statistics updated at every tweet, or every hour. So, first of all, it’s necessary to train a classifier that can be able to classify the new tweets into positive and negative.
The classifier chosen is a Naive-Bayesian classifier. This classifier is one of the simplest supervised machine learning methods and the results are acceptable.
The steps used to train the classifier are the following:
- Use NLTK with punkt
- Defined a set of useless words (with nltk stopwords and string punctuation) to tokenize correctly the tweets
- Tokenize the negative and the positive tweets and defined the features
- Made a Naive-Bayesian sentiment classifier with a train set
- Test the classifier and find the accuracy
As I wrote on the previous post, NLTK is a very powerful library, so it’s easy to create the classifier. The steps, translated into code, are the following:
As we can see, the accuracy of classifier is about 86%, this result depends on the few numbers of tweets used for the training. I used only 2000 tweets because the positive tweets, for this dataset are only about 2400. The numbers of negative tweets are higher but I can’t use all of those tweets. Is necessary train the classifier with the same number of positive and negative tweets to not introduce a bias.
The last step of this work is use the negative tweets to train a classifier for the reason of the negative tweets. The steps to define the classifier are the same of the first classifier, but now the subset, is only the negative tweets.
In this work I try to classify only the first two cause of negative tweets because the others don’t have enough amount of data. All the others tweets are classified as others.
For this classifier, the accuracy reached is 69%, not so good. But the number of tweets used is only 1000 and 400 to verify the classifier.
The first classifier, with accuracy of 86% is acceptable because the human accuracy is about 80%. But the accuracy of the second is not acceptable because is too low.
If we consider the combined accuracy (classification between negative and positive and on the negative tweets the calssification on the reasons of negative tweets) it goes down to 59%.
The principal reason of these results is because the dataset has few positive tweets and few tweets for each cause of bad fligth. Is not to exclude that with more complex classifier a better results can be reached.