Clustering for everyday life — 2 of 2-

pfeiffer-beach-california-bestbeaches0316

In my previous post, I wrote about clustering and k-means algorithm. In this post, I want to use those concepts and TensorFlow to write a simple example. (And help myself to plan my next trip to Gotham City).

For this implementation we need Python (I use 3.7 but 2.x it’s ok) and some packages:

  • matplotlib
  • TensorFlow
  • numpy
  • pandas

to install those packages is simple:

  • for python 2.x:
    • pip install <packages name>
  • for python 3.x:
    • pip3 install <packages name>

In any case, you can follow the installation instructions on the documentation of each package.

So far so good? Ok, let’s go deep into the code:

First of all, I defined the parameters of my problem:

  • number of points: 1000
  • number of clusters: 4
  • number of computational steps: 100

In this particular example, I used as training set, a set of 1000 GPS positions generated randomly [line from 27 to 36] about the position: 45.7 9.1. If you have a file with the correct coordinates, you can load them ad use the correct ones. The lines from 34 to 36 display the training set:

Start

In line 42 The vector values are converted into constant, usable by TensorFlow.

After randomly built the training set, we need the centroid [line from 44 to 50] and converted into variable that will be manipulated by TensorFlow. This is the key of K-means algorithm, we need a set of centroids to start the iterations.

The cost function for K-means is the distance between the point and the centroid, this algorithm tries to minimize this cost function. As I wrote in my previous post, the distance between two GPS points can’t be calculated with the euclidean distance, and is necessary to introduce a more precise method to compute di distance, one of this method is the spherical cosine law. For this example, I used an approximation of the spherical cosine law. This approximation works very well for the distance like city distance and is more computationally efficient than the implementation of the entire algorithm. To know more about this approximation and the error read this interesting post. [line from 53 to 63]

And finally, the centroids are updated [line 65]

Lines from 68 to 75 initialize all the variables, instantiate the evaluation graph, run the algorithm and visualize the results:

End

Conclusion:

My last two posts are focused on an algorithm for clustering problems: K-means. This algorithm takes some assumptions on data:

  • the variance of the distribution of each attribute is spherical
  • all variable has the same variance
  • the prior probability of each cluster is the same

If one if those assumptions are violated then the algorithm fail.

A possible con of this algorithm is the necessity to define, a priori, the number of clusters. If you don’t have any idea on how are your clusters, you can choose another clustering algorithm like DBSCAN or OPTICS (those algorithms work on a density model instead of a centroid model). Or you can introduce a postprocessing step in K-means that aggregate (or split) two or more centroids and then relaunch the entire algorithm on the same training set but with the new set of centroids.

From the computational point of view, the K-means algorithm is linear on the number of data object, others clustering algorithms have a quadratic complexity. So this can be an important point to keep in mind.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s