I want to explore some concept of sentiment analysis and try some libraries that can help in data analysis and sentiment analysis. The dataset used is “Twitter US Airline Sentiment” that can be easily found on Kaggle: https://www.kaggle.com/crowdflower/twitter-airline-sentiment
This dataset is very nice, contains tweets on US Airline of February 2015 classified in positive, negative and neutral tweets. The negative tweets are also classified in base of the negative reason.
This post is divided into two parts:
- First part: Data analysis on the dataset to find the best and the worst airlines and understand what are the most common problems in case of bad flight
- Second part: Training two Naive-Bayesian classifiers: first to classify the tweets into positive and negative And a second classifier to classify the negative tweets on the reason. In this way is possible to upgrade the statistics, of the first point, with new tweets.
This work can be useful for the airline company to understand what are the problems to working on.
For this work Python with: Pandas, jupyter notebook and NLTK will be used.
Pandas is an open source library providing high performance easy to use data structure and analysis tools for Python. I’ll use this library to load the dataset and make some analysis.
Jupyter notebook is very useful for data scientist because is a web application that allows to create and share documents that contain live code, equation, visualization and explanatory text. In this way share work is very easy.
NLTK is a platform used for building programs to work with human languages. With this library is simple to train a classifier for computational linguistic.
Ok, let’s start with data analysis. Watching the dataset, we can find a lot of columns but the most important are:
This dataset doesn’t need any cleaning operations but, for the question I want to answer, is necessary some transformations. First of all, is necessary to divide the tweets into three groups: positive,negative and neutral. Then I want to find the best and the worst Airline in the month of February 2015. After that I want to find the common problems caused a bad flight. For this consider only the negative tweets (obviously!).
The result for the worst airline is the following:
Num of tweets
And for the best is:
Num of tweets
From the dataset, the rank for the bad flight problems is:
|Customer Service Issue||2910|
|Flight Attendant Complaints||481|
|Flight Booking Problems||529|
So the most important reason for bad flight is due to a customer service… not so good for the companies…
In the next post I’ll train theNaive-Bayesian classifiers to analyze new tweets.