Sentiment analysis on US Twitter Airline dataset – 1 of 2


I want to explore some concept of sentiment analysis and try some libraries that can help in data analysis and sentiment analysis. The dataset used is  “Twitter US Airline Sentiment” that can be easily found on Kaggle:

This dataset is very nice, contains tweets on US Airline of February 2015 classified in positive, negative and neutral tweets. The negative tweets are also classified in base of the negative reason.
This post is divided into two parts:

  • First part: Data analysis on the dataset to find the best and the worst airlines and understand what are the most common problems in case of bad flight
  • Second part: Training two Naive-Bayesian classifiers: first to classify the tweets into positive and negative And a second classifier to classify the negative tweets on the reason.  In this way is possible to upgrade the statistics, of the first point, with new tweets.

This work can be useful for the airline company to understand what are the problems to working on.

For this work Python with: Pandas, jupyter notebook and NLTK will be used.

Pandas is an open source library providing high performance easy to use data structure and analysis tools for Python. I’ll use this library to load the dataset and make some analysis.

Jupyter notebook is very useful for data scientist because is a web application that allows to create and share documents that contain live code, equation, visualization and explanatory text. In this way share work is very easy.

NLTK is a platform used for building programs to work with human languages. With this library is simple to train a classifier for computational linguistic.  

Data analysis

Ok, let’s start with data analysis. Watching the dataset, we can find a lot of columns but the most important are:

  • airline
  • airline_sentiment
  • negativereason

This dataset doesn’t need any cleaning operations but, for the question I want to answer, is necessary some transformations. First of all, is necessary to divide the tweets into three groups: positive,negative and neutral. Then I want to find the best and the worst Airline in the month of February 2015. After that I want to find the common problems caused a bad flight. For this consider only the negative tweets (obviously!). 

The result for the worst airline is the following:


Num of tweets

United 2633
US Airways 2263
American 1960
Southwest 1186
Delta 955
Virgin America 181

And for the best is:


Num of tweets

Southwest 570
Delta 544
United 492
American 336
US Airways 269
Virgin America 152

From the dataset, the rank for the bad flight problems is:



Bad Flight 580
Can’t Tell 1190
Cancelled Flight 847
Customer Service Issue 2910
Damaged Luggage 74
Flight Attendant Complaints 481
Flight Booking Problems 529
Late Flight 1665
Lost Luggage 724
longlines 178


So the most important reason for bad flight is due to a customer service… not so good for the companies…

In the next post I’ll train theNaive-Bayesian classifiers to analyze new tweets.


Clustering for everyday life — 2 of 2-


In my previous post, I wrote about clustering and k-means algorithm. In this post, I want to use those concepts and TensorFlow to write a simple example. (And help myself to plan my next trip to Gotham City).

For this implementation we need Python (I use 3.7 but 2.x it’s ok) and some packages:

  • matplotlib
  • TensorFlow
  • numpy
  • pandas

to install those packages is simple:

  • for python 2.x:
    • pip install <packages name>
  • for python 3.x:
    • pip3 install <packages name>

In any case, you can follow the installation instructions on the documentation of each package.

So far so good? Ok, let’s go deep into the code:

First of all, I defined the parameters of my problem:

  • number of points: 1000
  • number of clusters: 4
  • number of computational steps: 100

In this particular example, I used as training set, a set of 1000 GPS positions generated randomly [line from 27 to 36] about the position: 45.7 9.1. If you have a file with the correct coordinates, you can load them ad use the correct ones. The lines from 34 to 36 display the training set:


In line 42 The vector values are converted into constant, usable by TensorFlow.

After randomly built the training set, we need the centroid [line from 44 to 50] and converted into variable that will be manipulated by TensorFlow. This is the key of K-means algorithm, we need a set of centroids to start the iterations.

The cost function for K-means is the distance between the point and the centroid, this algorithm tries to minimize this cost function. As I wrote in my previous post, the distance between two GPS points can’t be calculated with the euclidean distance, and is necessary to introduce a more precise method to compute di distance, one of this method is the spherical cosine law. For this example, I used an approximation of the spherical cosine law. This approximation works very well for the distance like city distance and is more computationally efficient than the implementation of the entire algorithm. To know more about this approximation and the error read this interesting post. [line from 53 to 63]

And finally, the centroids are updated [line 65]

Lines from 68 to 75 initialize all the variables, instantiate the evaluation graph, run the algorithm and visualize the results:



My last two posts are focused on an algorithm for clustering problems: K-means. This algorithm takes some assumptions on data:

  • the variance of the distribution of each attribute is spherical
  • all variable has the same variance
  • the prior probability of each cluster is the same

If one if those assumptions are violated then the algorithm fail.

A possible con of this algorithm is the necessity to define, a priori, the number of clusters. If you don’t have any idea on how are your clusters, you can choose another clustering algorithm like DBSCAN or OPTICS (those algorithms work on a density model instead of a centroid model). Or you can introduce a postprocessing step in K-means that aggregate (or split) two or more centroids and then relaunch the entire algorithm on the same training set but with the new set of centroids.

From the computational point of view, the K-means algorithm is linear on the number of data object, others clustering algorithms have a quadratic complexity. So this can be an important point to keep in mind.

Relational vs NoSql Database


What is a NoSql database? and how is different from a relational database? these are some frequent ansewes when somebody start to study the NoSql Database.

First of all: what is a relational database? Relational database is a collection of data items organized as a set of tables, the table describe the relationship between datas stored. The lenguage used to manage this kind of database is SQL (Structured Query Language).

NoSql database approach data management without tables but with other data models. This kinds of databases are very useful for very large and distribuited data sets. This database family seeks to solve the scalabilty and big data performance issues that relational databases weren’t designed to address.

In the last times, enterprise solutions are focused on manage a large amount and inhomogeneous data, so it’s clear why this types of databases are grown.

There are a lot of NoSql databases, the most famous are

  • MongoDb
  • Elastic
  • Neo4j
  • OrientDb

Every one are designed for specific goal.

In this posts serie i what to write on the graph database standard de facto: Neo4j DB.

What is a graph database? A graph management system (Graph database) is a NoSql database management system with Create, Read, Update and Delete (CRUD) methods that expose a graph data model.

In the next posts we’ll go deeper into graph db world, and we’ll use Neo4j for some tests.

Kotlin exercise 1: Connect to MongoDB -part 1-


After the firts posts on Kotlin, i want to creating something more difficult. And i chose to try a database connection in Kotlin.

In this exercise i chose to connect my little application to a Mongo database. This choice is driven by a book that i started to read in  these days: “MongoDb in action”, because i want to know more about this type of database. So i decided to try to use both new things: kotlin and MongoDB.

This exercise is organized in this way: first part is a java test connection and operations with MongoDB then the same operations are written in Kotlin.

Ok, let’s start with code:

Firt of all create the instance of mongodb client and create te DB:

import com.mongodb.*;

import java.util.Date;

 * Created by Claudio on 30/04/16.
public class Main {

    public static void main(String[] args) {

        try {
            MongoClient mongo = new MongoClient( "localhost" , 27017 );

            DB db = mongo.getDB("testDB");

            DBCollection table;
            table = db.getCollection ("clienti");

            //Insert document
            BasicDBObject document = new BasicDBObject();
            document.put("name", "mkyong");
            document.put("age", 30);
            document.put ("createdDate", new Date ( ));

            /**** Find and display ****/
            BasicDBObject searchQuery = new BasicDBObject();
            searchQuery.put("name", "mkyong");

            DBCursor cursor = table.find(searchQuery);

            while (cursor.hasNext()) {

        } catch (UnknownHostException e) {
            e.printStackTrace ( );
        } catch (MongoException e) {


this is a little example to show how to:

  • connect to MongoDB (line 14)
  • create or get database (lines 16-19)
  • create a document (lines 22-26)
  • get insered data (lines 29-32)
  • show on standart output the datas (34-35)

The result of running this code is:

Schermata 2016-05-04 alle 21.09.20

The next part is focused on the Kotlin equivalent code.


How to set up kotlin into android project

In this post I want to show how to start an android project with some kotlin classes:

Start android studio and install the kotlin plugin: go to “search everywhere” and write plugin. In Plugins select kotlin and install it.

Schermata 2016-03-30 alle 18.11.37

Schermata 2016-03-30 alle 18.13.02.png

After this, create a new Android project and then, into search everywhere, type Configure kotlin and select Configure kotlin in project. This operation modify your build.grandle and add some rows:

Schermata 2016-03-30 alle 18.13.43

Schermata 2016-03-30 alle 18.19.18.png

After this operation it’s possible to convert java code into kotlin code, so it’s possible to convert an activity written in java, into activity written in kotlin: Select the activity to convert and then select  “Convert java file to kotlin file” from Code menu

Schermata 2016-03-30 alle 18.20.43.png


Now we can start to write our kotlin code in this android project.

Kotlin – first steps –

What is Kotlin?

From the jetbrains blog :

Kotlin is a pragmatic programming language for JVM and Android that combines OO and functional features and is focused on interoperability, safety, clarity and tooling support.

Being a general-purpose language, Kotlin works everywhere where Java works: server-side applications, mobile applications (Android), desktop applications. It works with all major tools and services…

I want to try this new programming language for two reasons:

  • I want to study something new
  • I was born as c++ developer and, for me, Kotlin is very simple to read and to understand, so i can use this language where java is used.

The number of github Kotlin projects is esponenctially increased in the last few years, so this is another good reason to start to learn this programming language.



Well, after this introduction about “why” i want to learn Kotlin, let’s start with something interesting for the developer.

Like every fist step into new programming language I started from “HelloWorld”:

fun main(args: Array&lt;String&gt;){
    println(&quot;Hello World&quot;);

These two lines of code show how to create a function (fun), how to declare the parameters accepted by the function and hoe to write something on the standard output.

For me is very simple and intuitive.

I want to go deeper into the Kotlin world!