K-Means clustering intro

 K - Means clustering:

k-means clustering is one of the simplest algorithms which uses Unsupervised learning method to solve known clustering issues.

  • K = number of clusters

Let's say you have an unlabeled data set, as below. You want to group this data into clusters.

Now, the questions is how should you choose the optimum number of clusters? There are two possible ways for choosing the number of clusters.

  1. Elbow method 
  2. Purpose Based


Now, once we have the value of 'K', let's understand its execution.

Initialization:

Firstly you need to randomly initialize (any points can be taken as C1,C2...) two points called the cluster centroids. These two shown in diagram in two different colors.

Cluster Assignment:

Now depending on the minimum distance from the cluster centroid Red and the Blue cluster centroid, it will group itself into that particular group so data points divided into two groups. But these cluster formations are not optimized.

Move Centroid:

Now you will take the above two cluster centroids and iteratively reposition them for optimization. You will take all blue dots, compute their average and move current clusters centroid to this new location. Similarly, for Red cluster centroid. So lets see how can we optimize clusters which will give us better insight.

Optimization:

You need to repeat the above two steps iteratively till the cluster centroid stop changing their positions and became static. Then K-means clustering algorithm is said to be converged.

Convergence:

Finally, the k-means clustering algorithm converges and divides the data points into two clusters clearly visible in Red and Blue.

The K- means algorithm starts by randomly choosing a centroid  value for each cluster. After that algorithm iteratively performs three steps.

  1. Find the Euclidean distance between each data instance and centroids of all clusters
  2. Assign the data instances to the cluster of the centroid with nearest distance 
  3. Calculate the new centroid values based on the mean values of the coordinates of all the data instances from the corresponding cluster.


                                                 

Comments