Article on one of the most common interview question: How do you interpret KNN outputs?
KNN does not have a learning phase . Its a lazy algorithm that just finds the “k” nearest neighbours and performs the classification/regression task almost like a hard coded instruction and nothing “intelligent” seems to happen. While the idea behind the algorithm is pretty simple and straightforward ; it is this simplicity that leads to many possible questions , because when one tries to solve complex real life questions using such simple algorithms , many border cases must be considered.
lets try to answer a few .
We know KNN can be used to solve classification as well as regression. We start with problems faced during classification .
THE PROBLEM IN CLASSIFICATION
SUPPOSE YOU HAVE 4 CLASSES AND THE NUMBER OF NEAREST NEIGHBOURS (K) YOU CHOSE IS 30. FOR A CERTAIN POINT YOU GOT THE FOLLOWING RESULTS :
- CLASS 1 =10 NEIGHBOURS
- CLASS 2 =10 NEIGHBOURS
- CLASS 3= 6 NEIGHBOURS
- CLASS 4= 4 NEIGHBOURS
NOW WHAT SHOULD OUR TEST POINT BE CLASSIFIED AS ? FIRST LETS CONSIDER WHAT THE NEIGHBOURS FROM THE 3RD AND 4TH CLASS TELL US. CONSIDER A MEDICAL CASE , LET SAY CANCER DETECTION , WHICH CONSISTS OF 2 OUTPUT CLASSES , C1 AND C2 .
YOU USED KNN where k=10 AND GOT 6 AND 4 POINTS RESPECTIVELY AS THE NEIGHBOURS . JUST BECAUSE YOU HAVE MORE NEIGHBOURS OF CLASS 1 CAN YOU RULE OUT THE POSSIBILITY OF THE SECOND CANCER CELL BEING PRESENT ?
THE ANSWER IS NO.
In many cases you require to look at the “probabilistic ” results rather than the final selection using more number of neighbours . Considering the above case of 2 classes , following is how the results would differ .
- if we use simple majority vote KNN : output : cancer of class 1
- if we use probability scores : output : 60 % percent chance of cancer 1 and 40% chance of cancer 2 .
Now lets get back to the 4 class problem , of course there the maximum neighbours are from class 1 and class 2 . Even using probability scores if we need to give one final output as a solution to a business problem , how can we break the tie . In such cases a lot depends on the domain of the problem , but lets discuss 2 generic ways that we can use to break such a tie.
USING WEIGHTED KNN
FROM CLASS 1 AND CLASS 2 WE CAN SELECT THE FINAL OUTPUT BY USING WEIGHTED KNN .In Weighted KNN we assign more weightage to points that are closer to the test point . So instead of just counting neighbours we can assign “an inverse distance relation” while calculating distance scores .
INCREASE K BY A CERTAIN VALUE AND RECHECK
We used k=30 in the case provided. To beat the tie lets consider we use K=32 or 34 , now calculating the number of neighbours will remove the tie .
THE PROBLEM IN REGRESSION
In KNN ,regression refers to returning either the average /Median of a certain continuous value associated with the nearest neighbours. The problem here is not related to probability , rather it is related to the presence of outliers .
An outlier can mess up the average score but median score would be more robust to such issues .
So in summary a KNN (where k=n) works well only if there is always one such class which dominates the rest n-1 classes in terms of majority . And remember because there is no such thing as ” training” in KNN hence one can do nothing except changing k values if you find the neighbours distributed randomly in n classes.
This article focuses on one of the many problems that one can face during interviews. Other problems and their solutions like kmeans++, kd-trees and more would be discussed in subsequent posts.