photo-1614771797949-e486f4223276

KNN and Probability (interview questions)

Article on one of the most common interview question: How do you interpret KNN outputs?

KNN does not have a learning phase . Its a lazy algorithm that just finds the “k” nearest neighbours and performs the classification/regression task almost like a hard coded instruction and nothing “intelligent” seems to happen. While the idea behind the algorithm is pretty simple and straightforward ; it is this simplicity that leads to many possible questions , because when one tries to solve complex real life questions using such simple algorithms , many border cases must be considered.

lets try to answer a few .

We know KNN can be used to solve classification as well as regression. We start with problems faced during classification .

THE PROBLEM IN CLASSIFICATION

SUPPOSE YOU HAVE 4 CLASSES AND THE NUMBER OF NEAREST NEIGHBOURS (K) YOU CHOSE IS 30. FOR A CERTAIN POINT YOU GOT THE FOLLOWING RESULTS :

  1. CLASS 1 =10 NEIGHBOURS
  2. CLASS 2 =10 NEIGHBOURS
  3. CLASS 3= 6 NEIGHBOURS
  4. CLASS 4= 4 NEIGHBOURS

NOW WHAT SHOULD OUR TEST POINT BE CLASSIFIED AS ? FIRST LETS CONSIDER WHAT THE NEIGHBOURS FROM THE 3RD AND 4TH CLASS TELL US. CONSIDER A MEDICAL CASE , LET SAY CANCER DETECTION , WHICH CONSISTS OF 2 OUTPUT CLASSES , C1 AND C2 .

YOU USED KNN where k=10 AND GOT 6 AND 4 POINTS RESPECTIVELY AS THE NEIGHBOURS . JUST BECAUSE YOU HAVE MORE NEIGHBOURS OF CLASS 1 CAN YOU RULE OUT THE POSSIBILITY OF THE SECOND CANCER CELL BEING PRESENT ?

THE ANSWER IS NO.

In many cases you require to look at the “probabilistic ” results rather than the final selection using more number of neighbours . Considering the above case of 2 classes , following is how the results would differ .

  1. if we use simple majority vote KNN : output : cancer of class 1
  2. if we use probability scores : output : 60 % percent chance of cancer 1 and 40% chance of cancer 2 .

Now lets get back to the 4 class problem , of course there the maximum neighbours are from class 1 and class 2 . Even using probability scores if we need to give one final output as a solution to a business problem , how can we break the tie . In such cases a lot depends on the domain of the problem , but lets discuss 2 generic ways that we can use to break such a tie.

USING WEIGHTED KNN

FROM CLASS 1 AND CLASS 2 WE CAN SELECT THE FINAL OUTPUT BY USING WEIGHTED KNN .In Weighted KNN we assign more weightage to points that are closer to the test point . So instead of just counting neighbours we can assign “an inverse distance relation” while calculating distance scores .

ref: https://slideplayer.com/slide/4808190/

INCREASE K BY A CERTAIN VALUE AND RECHECK

We used k=30 in the case provided. To beat the tie lets consider we use K=32 or 34 , now calculating the number of neighbours will remove the tie .

THE PROBLEM IN REGRESSION

In KNN ,regression refers to returning either the average /Median of a certain continuous value associated with the nearest neighbours. The problem here is not related to probability , rather it is related to the presence of outliers .

An outlier can mess up the average score but median score would be more robust to such issues .

So in summary a KNN (where k=n) works well only if there is always one such class which dominates the rest n-1 classes in terms of majority . And remember because there is no such thing as ” training” in KNN hence one can do nothing except changing k values if you find the neighbours distributed randomly in n classes.

This article focuses on one of the many problems that one can face during interviews. Other problems and their solutions like kmeans++, kd-trees and more would be discussed in subsequent posts.