Clustering for everyday life — 2 of 2-

pfeiffer-beach-california-bestbeaches0316

In my previous post, I wrote about clustering and k-means algorithm. In this post, I want to use those concepts and TensorFlow to write a simple example. (And help myself to plan my next trip to Gotham City).

For this implementation we need Python (I use 3.7 but 2.x it’s ok) and some packages:

  • matplotlib
  • TensorFlow
  • numpy
  • pandas

to install those packages is simple:

  • for python 2.x:
    • pip install <packages name>
  • for python 3.x:
    • pip3 install <packages name>

In any case, you can follow the installation instructions on the documentation of each package.

So far so good? Ok, let’s go deep into the code:

First of all, I defined the parameters of my problem:

  • number of points: 1000
  • number of clusters: 4
  • number of computational steps: 100

In this particular example, I used as training set, a set of 1000 GPS positions generated randomly [line from 27 to 36] about the position: 45.7 9.1. If you have a file with the correct coordinates, you can load them ad use the correct ones. The lines from 34 to 36 display the training set:

Start

In line 42 The vector values are converted into constant, usable by TensorFlow.

After randomly built the training set, we need the centroid [line from 44 to 50] and converted into variable that will be manipulated by TensorFlow. This is the key of K-means algorithm, we need a set of centroids to start the iterations.

The cost function for K-means is the distance between the point and the centroid, this algorithm tries to minimize this cost function. As I wrote in my previous post, the distance between two GPS points can’t be calculated with the euclidean distance, and is necessary to introduce a more precise method to compute di distance, one of this method is the spherical cosine law. For this example, I used an approximation of the spherical cosine law. This approximation works very well for the distance like city distance and is more computationally efficient than the implementation of the entire algorithm. To know more about this approximation and the error read this interesting post. [line from 53 to 63]

And finally, the centroids are updated [line 65]

Lines from 68 to 75 initialize all the variables, instantiate the evaluation graph, run the algorithm and visualize the results:

End

Conclusion:

My last two posts are focused on an algorithm for clustering problems: K-means. This algorithm takes some assumptions on data:

  • the variance of the distribution of each attribute is spherical
  • all variable has the same variance
  • the prior probability of each cluster is the same

If one if those assumptions are violated then the algorithm fail.

A possible con of this algorithm is the necessity to define, a priori, the number of clusters. If you don’t have any idea on how are your clusters, you can choose another clustering algorithm like DBSCAN or OPTICS (those algorithms work on a density model instead of a centroid model). Or you can introduce a postprocessing step in K-means that aggregate (or split) two or more centroids and then relaunch the entire algorithm on the same training set but with the new set of centroids.

From the computational point of view, the K-means algorithm is linear on the number of data object, others clustering algorithms have a quadratic complexity. So this can be an important point to keep in mind.

Clustering for everyday life — 1 of 2-

foton1

Let’s consider this scenario: I love walking, so when I visit a city I want to walk as much as possible, but I want to optimize my time to watch as much as possible attractions. Now I want to plan my next trip to Gotham city to visit some Batman’s places. I found 1000 places in where Batman appeared and I have, at most, 4 days. I need to bucket those 1000 places into for 4 buckets, so that points are close to a center in where I can leave my car, to plan each day of my trip. How can I do this?

This kind of problem can be classified as a clustering problem. But what is clustering? Clustering or cluster analysis is the task of grouping a set of data into a selection of homogeneous or similar items. The concept of homogeneous or similar is defined in such way. So to solve this kind of problems is necessary:

  • Define the “resemblance” measure between elements (concept of similarity)
  • Find out if the subset of elements that are “similar”, in according to the measure chosen

The algorithm determines which elements form a cluster and what degrees of similarity unites them within a cluster. Refers to my previous post, clustering is a problem that can be solved with algorithms that belong to unsupervised methods, because the algorithm doesn’t know any kind of information about structure and characteristics of the clusters.

In particular, for this problem I’ll use the k-means algorithm: k-means is an algorithm that finds k groups (where k is defined) on a given dataset. Each group is described by a centroid that represents the “center” of each cluster. The concept of center is always referred to the concept of distance that we have chosen for the specific problem.

For our problem, the concept of distance is simple, because is the real distance between two points defined by a latitude and a longitude. For this reason, can’t be used the euclidean distance but is necessary to introduce the spherical law of cosine to compute the correct distance from to geographical points.

But how k-means algorithm work? Its follow an iterative procedure:

Flow graph for k-mean point

The popularity of this algorithm come from its:

  • convergence speed
  • ease of implementation

On the other hand, the algorithm doesn’t guarantee to achieve of the global optimum. The quality of the final solution strongly depends on the initial set of clusters. Since the algorithm is extremely fast, it’s possible to apply it several times and chose the best solution.

This algorithm starts with a definition of k cluster, where k is defined by the user. But how does the user know if k is the correct number? And how he know if the clusters are “good” clusters? One possible metrics to measure the quality of the clusters is SSE (Sum of square error), where error is the distance from the cluster centroid to the current point. Because this error is squared, this places more emphasis on the points far from the centroid.

In the next post, I’ll show a possible way to solve this problem in TensorFlow.  

Continuous Integration,delivery,deploy

seaside-06

Continuous Integration,continuous delivery and continuous deploy are terms used a lot into software world. But I think, in most cases, are used in the wrong way, by me in the first place.

I have read the guide to dev ops of DZone and I found a very simple and clear definitions of these things:

Continuous Integration is a software development practice in which you build and test software every time a developer pushes code to the application.

Continuous Delivery is a software engineering approach in which continuous integration, automated testing, and automated deployment capabilities allow software to be developed and deployed rapidly, reliably, and repeatedly with minimal human intervention. Still, the deployment to production is defined strategically and triggered manually.

Continuous Deployment is a software development practice in which every code change goes through the entire pipeline and is put into production automatically, resulting in many production deployments every day. It does everything that Continuous Delivery does, but the process is fully automated; there’s no human intervention at all.

So, at first sight, these are the definitions of the same thing, but it isn’t. Continuous integration stops at test stage, continuous delivery make a more step and stops at manually deployment to production. Continuous Deployment makes the last step and deploy automatically software to production.

I think is important to keep in mind these differences in order to make the right choose in your software production chain.

Relational vs NoSql Database

montagne-rocciose-1000x441

What is a NoSql database? and how is different from a relational database? these are some frequent ansewes when somebody start to study the NoSql Database.

First of all: what is a relational database? Relational database is a collection of data items organized as a set of tables, the table describe the relationship between datas stored. The lenguage used to manage this kind of database is SQL (Structured Query Language).

NoSql database approach data management without tables but with other data models. This kinds of databases are very useful for very large and distribuited data sets. This database family seeks to solve the scalabilty and big data performance issues that relational databases weren’t designed to address.

In the last times, enterprise solutions are focused on manage a large amount and inhomogeneous data, so it’s clear why this types of databases are grown.

There are a lot of NoSql databases, the most famous are

  • MongoDb
  • Elastic
  • Neo4j
  • OrientDb

Every one are designed for specific goal.

In this posts serie i what to write on the graph database standard de facto: Neo4j DB.

What is a graph database? A graph management system (Graph database) is a NoSql database management system with Create, Read, Update and Delete (CRUD) methods that expose a graph data model.

In the next posts we’ll go deeper into graph db world, and we’ll use Neo4j for some tests.

how test a kotlin class?

flweekendbreakers_t5

TDD is a modern software development process based on creation of automatic test before write code, so modern programming can’t be called in this way without test the software. In this post i want to write something about testing software written in kotlin .

Ok, after this little presentation let’s start to write some code: in the previous post on kotlin the problem was connect to mongo db and write some data. Now i want to complicate this example and the challenge is write a class (data class) and then save it into the database.

In kotlin write a data class is very simple (can find here), my data class for this example is this:

data class Person (val name:String, val surname:String, val age:Int)

Testing this class is very simple, we can test it in the same java way: with junit!

On IntelliJ create test can be done directly from the class:

Schermata 2016-05-22 alle 21.41.20

Schermata 2016-05-22 alle 21.44.54

We can select which methods to test and the IDE create the skeleton for the test class:

Schermata 2016-05-22 alle 21.48.29

I saved this test class in a test folder to keep everything organized. It’s important to configure the test folder for the project:

Schermata 2016-05-22 alle 21.55.26

Schermata 2016-05-22 alle 21.55.44

Now, all it’s configured, and now i can test my class! Run the test class and the result is something like this:

Schermata 2016-05-22 alle 21.58.28.png

gatsby-is-already-doing-great-at-the-box-office-despite-harsh-reviews      All Green! Great!

 

Kotlin exercise 1: Connect to MongoDB -part 2-

beach-beautiful-beautiful-beach_1600x1200_93247

In the first part of this exercise we have seen how connect to a MongoDb with Java. Now that is clear how the connection can be performed, we can try to do the same with Kotlin.
Ok, no waste other time, start coding:


import com.mongodb.BasicDBObject
import com.mongodb.MongoClient
import com.mongodb.MongoException
import java.net.UnknownHostException
import java.util.*

/**
 * Created by Claudio on 01/05/16.
 * This is a main to test mongoDB connection in kotlin
 */

fun main(args: Array<String>) {

    try
    {
        val mongo = MongoClient("localhost", 27017)

        val db = mongo.getDB("testDB")

        val table = db.getCollection("person")

        //Reflection example
        val person = Person("Jon","Doe",20);

        val data = person.javaClass;


        //Insert document
        val document = BasicDBObject()
        document.put("name", "mkyong")
        document.put("age", 30)
        document.put("createdDate", Date())
        table.insert(document)

        /**** Find and display ****/
        val searchQuery = BasicDBObject();
        searchQuery.put("name", "mkyong");

        val cursor = table.find(searchQuery);

        while (cursor.hasNext()) {
            System.out.println(cursor.next());
        }
    }
    catch (e: UnknownHostException) {
        e.printStackTrace ( );
    } catch (e: MongoException) {
        e.printStackTrace();
}

}
  

Sorry but the kotlin code isn’t highlighted

How we see the code is similar with Java example and the output is equal:

Schermata 2016-05-04 alle 21.10.01

This code can be converted from Java code with the intellij IDE function called: “Convert java code to kotlin code”. This can be reached from the search tool or vith copy and paste of java code to kotlin file:

Cattura

For exercise the “Main.kt” code is was written from scratch.

In this little exercice I learned how connect to MongoDb and how code something more difficult to “hello world” in Kotlin.

A little step for mankind, a big step for a developer!

featured

Kotlin exercise 1: Connect to MongoDB -part 1-

3_maldive

After the firts posts on Kotlin, i want to creating something more difficult. And i chose to try a database connection in Kotlin.

In this exercise i chose to connect my little application to a Mongo database. This choice is driven by a book that i started to read in  these days: “MongoDb in action”, because i want to know more about this type of database. So i decided to try to use both new things: kotlin and MongoDB.

This exercise is organized in this way: first part is a java test connection and operations with MongoDB then the same operations are written in Kotlin.

Ok, let’s start with code:

Firt of all create the instance of mongodb client and create te DB:

import com.mongodb.*;

import java.net.UnknownHostException;
import java.util.Date;

/**
 * Created by Claudio on 30/04/16.
 */
public class Main {

    public static void main(String[] args) {

        try {
            MongoClient mongo = new MongoClient( "localhost" , 27017 );

            DB db = mongo.getDB("testDB");

            DBCollection table;
            table = db.getCollection ("clienti");

            //Insert document
            BasicDBObject document = new BasicDBObject();
            document.put("name", "mkyong");
            document.put("age", 30);
            document.put ("createdDate", new Date ( ));
            table.insert(document);

            /**** Find and display ****/
            BasicDBObject searchQuery = new BasicDBObject();
            searchQuery.put("name", "mkyong");

            DBCursor cursor = table.find(searchQuery);

            while (cursor.hasNext()) {
                System.out.println(cursor.next());
            }

        } catch (UnknownHostException e) {
            e.printStackTrace ( );
        } catch (MongoException e) {
            e.printStackTrace();
        }
    }

}

this is a little example to show how to:

  • connect to MongoDB (line 14)
  • create or get database (lines 16-19)
  • create a document (lines 22-26)
  • get insered data (lines 29-32)
  • show on standart output the datas (34-35)

The result of running this code is:

Schermata 2016-05-04 alle 21.09.20

The next part is focused on the Kotlin equivalent code.