The goal of this work is to build a pipeline to classify tweets on US airlines and show a possible dashboard to understand the customer satisfaction trends.
For this work, the pipeline is composed by:
- Kafka: to ingest tweets
- Python services: to classify the tweets with neural network
- Elasticsearch: to store the results
- Kibana: to create the dashboard.
With Kafka the tweets are ingested into the pipeline, in this way the pipeline can handle a lot of tweets. An ingested tweet is pull from the topic, by one or more python services that classify the tweet with the neural network trained in my last post (here). The neural network training results were good, the accuracy of prediction is about 94%, so this network could be production ready for classification task.
When a tweet is classified, it is saved by the python service into an Elasticsearch index. Over the index, A Kibana dashboard, allow user to monitor the american airlines trend.
Save the neural network
To build this pipeline, it’s necessary a little refactor of the training routine written in the previous post, because neural network needs to be saved.
The refactor is the following:
To save the neural network it’s necessary declare a ModelCheckpoint and then define a callback to this checkpoint into model.fit. In this way, neural network is saved only if the result is better than last saved model. After this step, the best trained neural network is saved in a hdf5 file, and can be loaded by the pipeline.
Load the neural network
The pipeline core is one or more python services that load the neural network and perform the tweet classification. This service can scale well because is totally stateless: take one tweet from Kafka topic, classify and write to Elasticsearch
From the code point of view, the neural network can be loaded in a very, very, very simple way:
with load_module from keras.models it is possible load hdf5 file saved during training phase.
An important step is define the Elasticsearch index. To do this step in the right way, it’s important to know what is the use case: the dashboard goal is to show the ratio of positive tweets and negative tweets and show the last tweets and how are classified.
So the fields defined are:
- tweet: field with type text to save the tweet and allow a possible full text search.
- Classification: is defined as keywords
- positive_classification_confidence: float and will be used in the future.
- negative_classification_confidence : float and will be used in the future.
Full index definition is the following:
The dashboard created is very simple:
Can be seen the total number of positive and negative tweets in a pie chart, and a data table with the last tweets how are classified by the neural network.
This use case is a toy use case: is only an expedition to build a data analysis pipeline with a neural network, and play with some cool stuffs like elasticsearch, kibana and kafka.
The dashboard shows very few information: only the tweets classification, and this is why the dataset used for training the neural network is very little (only one month of tweets).
But if the data set would be bigger, some interesting analysis could be done from the american airlines point of view, for example:
- know what are the main cause of a bad flight, in the dataset this information is available but there aren’t enough data to train the a good network
- know what are the most problematic airport
Please feel free to let me know what you think about this post and how this work could be better.