Sentiment Analysis on US Twitter Airlines dataset: a deep learning approach

monte-bianco-sfond
Monte Bianco, Italian Alps 

In two of my previous posts (this and this), I tried to make a sentiment analysis on the twitter airline data set with one of the classic machine learning technique: Naive-Bayesian classifiers. For this post I did one classifier with a deep learning approach. This work won’t be seminal, it’s only an expedient to play, a little bit, with neural networks.

For this work I used Tensorflow and Keras to define the neural network and the new Jupyter Lab to write the code (I think it’s really cool!). If you would, you can find my data science environment, with all of these stuffs dockerized, at this link.

Ok, now let’s talk about the neural network used in this post, the most interesting layer is the LSTM layer. If you want to know more about LSTM I suggest to the read this post of Christopher Olah blog. LSTM layes are widely used for language processing, this is why I used this kind of layer for my analysis. A schema of the very simple neural network for this example if the following:

Network

The entire notebook used for this analysis is just down here and can be found on my github profile here. Every code block is commented, so I don’t want to annoying you with a lot of words, let’s the code talks…

Conclusion

To train this network I used my dockerized data science environment on my  laptop without any kind of GPU in a few minutes.

As we can see from the graphs: “Training and validation loss” and “Training and validation accuracy”, the 3th epoch is the best before the network start to over fitting the data.

The accuracy of prediction, with this network, is jumped from 86% to 94%, compared to the previous Naive-Bayesian classifiers, with a very simple network and few epochs. The accuracy for the positive tweets is increased too. Despite the accuracy increase with this kind of network, I think the accuracy can be improved, and this is the goal of my next tests.

Please feel free to comment and contact me to discuss about this post! 🙂

 

 

Advertisements

Consider to introduce docker to your data science workflow

tre-cime-de-lavaredo-getty-1488580892

One of the big issue as data scientist is to configure correctly the data science environment. Sometimes this means install a lot of packages, wait time for package to compile, handing obscure errors, set up everything to work correctly and most of the time is a pain.But configure correctly the environment is necessary to reproduce the analysis, or when we need to share work with others.

For these reasons, to overcome the problem, I introduced docker in my data science workflow.

What is Docker?

Docker is a tool that simplifies the installation process for software engineers. To explain in very simple way (sorry docker gurus for this definition): Docker creates a super lightweight virtual machine that can be run in very few milliseconds and contain all we need to run our environment in the right way.

If you would read more this is the Docker official website.

The goal of this post is to create an environment to run a very simple jupyter notebook.

First of all we need to install Docker for the correct platform, refer to this tutorial to perform the correct steps.

Now we can start to create our environment. Really we can pull a ready to use container for this, on docker hub there are a lot of ready to use images, for example:

  • dataquestio/python3-starter
  • dataquestio/python2-starter

but my target is create my own environment from scratch!

Open our favorite text editor and start to create the dockerfile. Dockerfile is a file that describe how the container will be built:

  1. Start with a simple python3 image that is based on debian.
  2. Then update all packages at last version
  3. copy the requirements.txt that describe all python packages we need for our data science environment.
  4. run the installation of all packages
  5. expose the port for jupyter
  6. and run the command to start the jupyter notebook.

Now it’s time to write the requirements.txt. This file describe all the python packages we need and will be used by pip to install all packages correctly.

Ok we are ready to compile our container, the command is:

docker build -t your_container_name .

with -t option we can tag our container, for example:

 docker build -t mydatascience_env .

When the build process is finished we can run our container:

docker run -p 8887:8888 -v /path_your_machine/notebook_folder/:/Documents -it datascience_env

With -v option /path_your_machine/notebook_folder/ will be mounted into the docker container at /Documents path.

This is useful to save the work and to keep separate the environment from the notebook.  I prefer this way organize my work, instead to create a Docker container that contain  the environment and notebook too.

Ok when the container is up, we can open the jupyter web interface:

http://127.0.0.1:8007

and when the token is asked we put ‘mynotebook’, or whatever you set into your dockerfile, and that’s all! Now we can work into our new data science environment.

Click on Documents we have all our notebook! Note: every change will be saved when the container will be stopped.

To test this environment I used the example of DBSCAN founded on sklearn website. This is the link.

When our work is finished, we can stop the container with the command:

docker  stop datascience_env

I think docker is a very important tool for every developer and for every data scientist to deploy and share works. From my point of view the most important innovation Docker introduce is a way to describe how to recreate correctly an environment where my code can run (with a dockerfile). In this way I can reproduce, every time, the exactly environment I used during my development process and I can share the container built with everyone.  

 

Sentiment analysis on US Twitter Airline dataset – 2 of 2

Tre cime di Lavaredo – Dolomiti – Italy

The results of the analysis made in the last post, are found on dataset. But now, my goal is to have these statistics updated at every tweet, or every hour. So, first of all, it’s necessary to train a classifier that can be able to classify the new tweets into positive and negative.

The classifier chosen is a Naive-Bayesian classifier. This classifier is one of the simplest supervised machine learning methods and the results are acceptable.

The steps used to train the classifier are the following:

  • Use NLTK with punkt
  • Defined a set of useless words (with nltk stopwords and string punctuation) to tokenize correctly the tweets
  • Tokenize the negative and the positive tweets and defined the features
  • Made a Naive-Bayesian sentiment classifier with a train set
  • Test the classifier and find the accuracy

As I wrote on the previous post, NLTK is a very powerful library, so it’s easy to create the classifier. The steps, translated into code, are the following:

As we can see, the accuracy of classifier is about 86%, this result depends on the few numbers of tweets used for the training. I used only 2000 tweets because the positive tweets, for this dataset are only about 2400. The numbers of negative tweets are higher but I can’t use all of those tweets. Is necessary train the classifier with the same number of positive and negative tweets to not introduce a bias.

The last step of this work is use the negative tweets to train a classifier for the reason of the negative tweets. The steps to define the classifier are the same of the first classifier, but now the subset,  is only the negative tweets.

In this work I try to classify only the first two cause of negative tweets because the others don’t have enough amount of data. All the others tweets are classified as others.

For this classifier, the accuracy reached is 69%, not so good. But the number of tweets used is only 1000 and 400 to verify the classifier.

Conclusion

The first classifier, with accuracy of 86% is acceptable because the human accuracy is about 80%. But the accuracy of the second is not acceptable because is too low.

If we consider the combined accuracy (classification between negative and positive and on the negative tweets the calssification on the reasons of negative tweets) it goes down to 59%. 

The principal reason of these results is because the dataset has few positive tweets and few tweets for each cause of bad fligth. Is not to exclude that with more complex classifier a better results can be reached.

Sentiment analysis on US Twitter Airline dataset – 1 of 2

940_zoom_dolomiti_unesco_valbadia_02

I want to explore some concept of sentiment analysis and try some libraries that can help in data analysis and sentiment analysis. The dataset used is  “Twitter US Airline Sentiment” that can be easily found on Kaggle: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

This dataset is very nice, contains tweets on US Airline of February 2015 classified in positive, negative and neutral tweets. The negative tweets are also classified in base of the negative reason.
This post is divided into two parts:

  • First part: Data analysis on the dataset to find the best and the worst airlines and understand what are the most common problems in case of bad flight
  • Second part: Training two Naive-Bayesian classifiers: first to classify the tweets into positive and negative And a second classifier to classify the negative tweets on the reason.  In this way is possible to upgrade the statistics, of the first point, with new tweets.

This work can be useful for the airline company to understand what are the problems to working on.

For this work Python with: Pandas, jupyter notebook and NLTK will be used.

Pandas is an open source library providing high performance easy to use data structure and analysis tools for Python. I’ll use this library to load the dataset and make some analysis.

Jupyter notebook is very useful for data scientist because is a web application that allows to create and share documents that contain live code, equation, visualization and explanatory text. In this way share work is very easy.

NLTK is a platform used for building programs to work with human languages. With this library is simple to train a classifier for computational linguistic.  

Data analysis

Ok, let’s start with data analysis. Watching the dataset, we can find a lot of columns but the most important are:

  • airline
  • airline_sentiment
  • negativereason

This dataset doesn’t need any cleaning operations but, for the question I want to answer, is necessary some transformations. First of all, is necessary to divide the tweets into three groups: positive,negative and neutral. Then I want to find the best and the worst Airline in the month of February 2015. After that I want to find the common problems caused a bad flight. For this consider only the negative tweets (obviously!). 

The result for the worst airline is the following:

Airline

Num of tweets

United 2633
US Airways 2263
American 1960
Southwest 1186
Delta 955
Virgin America 181

And for the best is:

Airline

Num of tweets

Southwest 570
Delta 544
United 492
American 336
US Airways 269
Virgin America 152

From the dataset, the rank for the bad flight problems is:

Problem

Tweets_count

Bad Flight 580
Can’t Tell 1190
Cancelled Flight 847
Customer Service Issue 2910
Damaged Luggage 74
Flight Attendant Complaints 481
Flight Booking Problems 529
Late Flight 1665
Lost Luggage 724
longlines 178

Conclusion

So the most important reason for bad flight is due to a customer service… not so good for the companies…

In the next post I’ll train theNaive-Bayesian classifiers to analyze new tweets.

Clustering for everyday life — 2 of 2-

pfeiffer-beach-california-bestbeaches0316

In my previous post, I wrote about clustering and k-means algorithm. In this post, I want to use those concepts and TensorFlow to write a simple example. (And help myself to plan my next trip to Gotham City).

For this implementation we need Python (I use 3.7 but 2.x it’s ok) and some packages:

  • matplotlib
  • TensorFlow
  • numpy
  • pandas

to install those packages is simple:

  • for python 2.x:
    • pip install <packages name>
  • for python 3.x:
    • pip3 install <packages name>

In any case, you can follow the installation instructions on the documentation of each package.

So far so good? Ok, let’s go deep into the code:

First of all, I defined the parameters of my problem:

  • number of points: 1000
  • number of clusters: 4
  • number of computational steps: 100

In this particular example, I used as training set, a set of 1000 GPS positions generated randomly [line from 27 to 36] about the position: 45.7 9.1. If you have a file with the correct coordinates, you can load them ad use the correct ones. The lines from 34 to 36 display the training set:

Start

In line 42 The vector values are converted into constant, usable by TensorFlow.

After randomly built the training set, we need the centroid [line from 44 to 50] and converted into variable that will be manipulated by TensorFlow. This is the key of K-means algorithm, we need a set of centroids to start the iterations.

The cost function for K-means is the distance between the point and the centroid, this algorithm tries to minimize this cost function. As I wrote in my previous post, the distance between two GPS points can’t be calculated with the euclidean distance, and is necessary to introduce a more precise method to compute di distance, one of this method is the spherical cosine law. For this example, I used an approximation of the spherical cosine law. This approximation works very well for the distance like city distance and is more computationally efficient than the implementation of the entire algorithm. To know more about this approximation and the error read this interesting post. [line from 53 to 63]

And finally, the centroids are updated [line 65]

Lines from 68 to 75 initialize all the variables, instantiate the evaluation graph, run the algorithm and visualize the results:

End

Conclusion:

My last two posts are focused on an algorithm for clustering problems: K-means. This algorithm takes some assumptions on data:

  • the variance of the distribution of each attribute is spherical
  • all variable has the same variance
  • the prior probability of each cluster is the same

If one if those assumptions are violated then the algorithm fail.

A possible con of this algorithm is the necessity to define, a priori, the number of clusters. If you don’t have any idea on how are your clusters, you can choose another clustering algorithm like DBSCAN or OPTICS (those algorithms work on a density model instead of a centroid model). Or you can introduce a postprocessing step in K-means that aggregate (or split) two or more centroids and then relaunch the entire algorithm on the same training set but with the new set of centroids.

From the computational point of view, the K-means algorithm is linear on the number of data object, others clustering algorithms have a quadratic complexity. So this can be an important point to keep in mind.

Clustering for everyday life — 1 of 2-

foton1

Let’s consider this scenario: I love walking, so when I visit a city I want to walk as much as possible, but I want to optimize my time to watch as much as possible attractions. Now I want to plan my next trip to Gotham city to visit some Batman’s places. I found 1000 places in where Batman appeared and I have, at most, 4 days. I need to bucket those 1000 places into for 4 buckets, so that points are close to a center in where I can leave my car, to plan each day of my trip. How can I do this?

This kind of problem can be classified as a clustering problem. But what is clustering? Clustering or cluster analysis is the task of grouping a set of data into a selection of homogeneous or similar items. The concept of homogeneous or similar is defined in such way. So to solve this kind of problems is necessary:

  • Define the “resemblance” measure between elements (concept of similarity)
  • Find out if the subset of elements that are “similar”, in according to the measure chosen

The algorithm determines which elements form a cluster and what degrees of similarity unites them within a cluster. Refers to my previous post, clustering is a problem that can be solved with algorithms that belong to unsupervised methods, because the algorithm doesn’t know any kind of information about structure and characteristics of the clusters.

In particular, for this problem I’ll use the k-means algorithm: k-means is an algorithm that finds k groups (where k is defined) on a given dataset. Each group is described by a centroid that represents the “center” of each cluster. The concept of center is always referred to the concept of distance that we have chosen for the specific problem.

For our problem, the concept of distance is simple, because is the real distance between two points defined by a latitude and a longitude. For this reason, can’t be used the euclidean distance but is necessary to introduce the spherical law of cosine to compute the correct distance from to geographical points.

But how k-means algorithm work? Its follow an iterative procedure:

Flow graph for k-mean point

The popularity of this algorithm come from its:

  • convergence speed
  • ease of implementation

On the other hand, the algorithm doesn’t guarantee to achieve of the global optimum. The quality of the final solution strongly depends on the initial set of clusters. Since the algorithm is extremely fast, it’s possible to apply it several times and chose the best solution.

This algorithm starts with a definition of k cluster, where k is defined by the user. But how does the user know if k is the correct number? And how he know if the clusters are “good” clusters? One possible metrics to measure the quality of the clusters is SSE (Sum of square error), where error is the distance from the cluster centroid to the current point. Because this error is squared, this places more emphasis on the points far from the centroid.

In the next post, I’ll show a possible way to solve this problem in TensorFlow.  

Continuous Integration,delivery,deploy

seaside-06

Continuous Integration,continuous delivery and continuous deploy are terms used a lot into software world. But I think, in most cases, are used in the wrong way, by me in the first place.

I have read the guide to dev ops of DZone and I found a very simple and clear definitions of these things:

Continuous Integration is a software development practice in which you build and test software every time a developer pushes code to the application.

Continuous Delivery is a software engineering approach in which continuous integration, automated testing, and automated deployment capabilities allow software to be developed and deployed rapidly, reliably, and repeatedly with minimal human intervention. Still, the deployment to production is defined strategically and triggered manually.

Continuous Deployment is a software development practice in which every code change goes through the entire pipeline and is put into production automatically, resulting in many production deployments every day. It does everything that Continuous Delivery does, but the process is fully automated; there’s no human intervention at all.

So, at first sight, these are the definitions of the same thing, but it isn’t. Continuous integration stops at test stage, continuous delivery make a more step and stops at manually deployment to production. Continuous Deployment makes the last step and deploy automatically software to production.

I think is important to keep in mind these differences in order to make the right choose in your software production chain.