Let’s consider this scenario: I love walking, so when I visit a city I want to walk as much as possible, but I want to optimize my time to watch as much as possible attractions. Now I want to plan my next trip to Gotham city to visit some Batman’s places. I found 1000 places in where Batman appeared and I have, at most, 4 days. I need to bucket those 1000 places into for 4 buckets, so that points are close to a center in where I can leave my car, to plan each day of my trip. How can I do this?

This kind of problem can be classified as a clustering problem. But what is clustering? Clustering or cluster analysis is the task of grouping a set of data into a selection of homogeneous or similar items. The concept of homogeneous or similar is defined in such way. So to solve this kind of problems is necessary:

- Define the “resemblance” measure between elements (concept of similarity)
- Find out if the subset of elements that are “similar”, in according to the measure chosen

The algorithm determines which elements form a cluster and what degrees of similarity unites them within a cluster. Refers to my previous post, clustering is a problem that can be solved with algorithms that belong to unsupervised methods, because the algorithm doesn’t know any kind of information about structure and characteristics of the clusters.

In particular, for this problem I’ll use the k-means algorithm: k-means is an algorithm that finds k groups (where k is defined) on a given dataset. Each group is described by a centroid that represents the “center” of each cluster. The concept of center is always referred to the concept of distance that we have chosen for the specific problem.

For our problem, the concept of distance is simple, because is the real distance between two points defined by a latitude and a longitude. For this reason, can’t be used the euclidean distance but is necessary to introduce the spherical law of cosine to compute the correct distance from to geographical points.

But how k-means algorithm work? Its follow an iterative procedure:

The popularity of this algorithm come from its:

- convergence speed
- ease of implementation

On the other hand, the algorithm doesn’t guarantee to achieve of the global optimum. The quality of the final solution strongly depends on the initial set of clusters. Since the algorithm is extremely fast, it’s possible to apply it several times and chose the best solution.

This algorithm starts with a definition of k cluster, where k is defined by the user. But how does the user know if k is the correct number? And how he know if the clusters are “good” clusters? One possible metrics to measure the quality of the clusters is SSE (Sum of square error), where error is the distance from the cluster centroid to the current point. Because this error is squared, this places more emphasis on the points far from the centroid.

*In the next post, I’ll show a possible way to solve this problem in TensorFlow. *