Here we discuss the failure cases of the K-means algorithm .
As we know , k-means is one of the most commonly asked algorithm in interviews its advisable to know its failure cases along with its working principle.
K means is a very simple and straight forward algorithm , and hence its not surprising to hear that it will can run into complications in real life scenarios . lets get to the cases were we get absurd results from this algorithm .
DIFFERENT CLUSTER SIZES
In k means , the first step is to initialize the centroids , (which are done randomly in kmeans and probabilistically in kmeans++)
now look at this image, consisting of 3 clusters, the left side shows the original ,( although in us-supervised learning there are no “original tags” or “labels” but here we consider pre labelled data in order to have an understanding that how k means can deviate from real life truth)
you can see how because k means tends to make clusters of equal sizes , it is likely to perform weakly on differently sized clusters.
DIFFERENT DENSITY CLUSTERS
In real life data sets its very common to have class imbalances, k means fails to perform efficiently if the different clusters are of different densities. look at the image below.
notice that due to difference in densities, the clusters formed are very different from the original points.
NON GLOBULAR DISTRIBUTIONS
K means algorithm is all about updating centroids based on the mean distances and the approach never considers that whether the clusters can have an intrinsic pattern ,as you saw before k means tend to create clusters that are of same sizes around centroids and hence a more “symmetric” approach around each centroid is there. You can see how terribly the k means algorithm fails in the below cluster shape:
PRESENCE OF OUTLIERS
Outliers are a common problem with algorithms that deal with mean/average of data points . Outliers lead to problems in K means too . But how are outliers handled in data sets which are not labelled ?
There are various methods that are used to handle outliers, like RANSAC , LOF(Local outlier factor) and more .
DIFFERENT RESULTS EVERYTIME
Since the very first step, that is initialisation of centroids is done randomly in k means , the obtained clusters are always different based on the choice of centroids. Hence running the algorithm multiple times will lead to different outcomes .
Although one thing is assured! K means will always converge. Hence it can be mathematically proven that there exists a certain iteration i , such that centroids will change negligibly at the (i+1)th iteration .
more to come!