Kolmogorov–Smirnov test ,comparing distributions

Understanding and recognizing the distribution of data points is one of the important tasks of any Machine learning problem. Many mathematical assumptions , feature engineering techniques or algorithm selections are dependent on the nature of distribution of the data points.

This inevitable study of distributions often lead to a situation where where we need to compare 2 distributions . A simple example of such a case is when you need to compare whether your train and test points come from same distribution. Another question that arises in this context is that how to “Quantify” differences in 2 distributions . Its not as easy as subtracting 2 scalars or vectors.

KS test

ks test or Kolmogorov–Smirnov test is a method used to find out whether 2 distributions are similar or not .Concepts like hypothesis testing, p-value are pre-requisites for understanding this test. KS test utilizes the concept of hypothesis testing.

In any hypothesis test there is a null-hypothesis , which in this case is ” The 2 samples come from same distribution”. Before understanding when or how do we pertain or reject this hypothesis lets look at a few steps below:

  1. So the problem statement is that we need to conclude whether 2 collections of data-points {Xn,Ym} come from the same distribution. (n, m are the number of points in each sample set)
  2. The first step of KS test is to draw the CDF of both the sample sets.
  3. Now, we need an expression or function which can quantify the difference between these 2 CDFs .The below expression , known as K-S Statistic solves or purpose:

After we Know the KS statistic +the null hypothesis we come to the part which tells us when to reject the null hypothesis.

Have a look at the below image:

Comparison with p value

from the above expression of Dn,m , one can easily get a relation of the form F<alpha, ( by simple squaring and taking exponent) . Below you can see the result:

exp{-2(Dn,m)2 *(nm/(n+m))} < alpha

comparing the above expression with ( p-value< alpha) of hypothesis testing ,we can say that the LHS is the p-value of ks-test (notice it depends on the number of points in both sample sets).

The Kolmogorov–Smirnov statistic in more than one dimension

One approach to generalizing the Kolmogorov–Smirnov statistic to higher dimensions which meets the above concern is to compare the cdfs of the two samples with all possible orderings, and take the largest of the set of resulting K–S statistics. In d dimensions, there are 2d−1 such orderings.

WIKIPEDIA

More to come!