In this article we discuss the important mathematical tools used in formulation of SVMs and also point out the important questions asked in interviews .
SVMs are one of those algorithms which people use as a black box , but interviewers might end up asking the details of the mathematics involved And in general its always better to know the math behind any algorithm.
THE PROBLEM STATEMENT
Classification: THE GOAL OF SVM IS TO FIND A HYPERPLANE IN THE d DIMENSIONAL VECTORSPACE( dimensionality of the data set) WHICH SEPARATES THE DATA WITH “MOST MARGIN”, OR IN OTHER WORDS “OPTIMALLY” ( as multiple hyperplanes can separate the points , we desire the best out of all those”
QUESTION-WHAT IS HARD MARGIN SVM?
IN HARD MARGIN SVM WE HAVE ZERO TOLERANCE FOR ANY ERROR , IT TRIES TO FIND THE OPTIMAL HYPERPLANE ASSUMING ALL POINTS WILL BE CORRECTLY CLASSIFIED AT THE END .
SO WHAT ARE WE TRY TO OPTIMIZE IN SVMs?
lets consider a point whose actual class is y( -1 or 1) , lets consider the predicted value for that point is (w.x+b) . note that there are 4 cases possible :
- the product y*(w.x+b) is positive because y is “+1” and predicted distance from plane is also positive.
- the product y*(w.x+b) is positive because y is “-1” and predicted distance from plane is also negative.
- the product y*(w.x+b) is negative because y is “+1” and predicted distance from plane is negative.
- the product y*(w.x+b) is negative because y is “-1” and predicted distance from plane is positive.
you can see that everytime there is a wrong classification we see this product is negative. and every correct classification means positive product .
So we can start by saying that our optimal hyperparameter will try to maximize the product to positive for all the points , from now we will use w/||w|| to introduce scale invariancy in the hyperplane.
now a bit tricky statememt , ” the most optimal hyperplane is that whose min. value of the above product( which will occur for the closest point) should be largest among all hyperplanes ” . In layman terms , the closest point (whose distance is called margin ) should be maximised . So the following is our optimisation problem :
where m is the margin , (distance from the closest point) . the hyperplane with the max M will be selected.
where gamma is the product function discussed above. now we try to tweak the above expression. remember from your class 12th 3-d geometry classes that scaling the w and b wont change the final optimal hyperplane , so if we rescale the w and b to make the min product=1( but why? so that our optimization problem can be reduced to a single variable ||w|| ! ). Now we will be left only with
max(w,b) 1/||w|| subject to the same conditions as above , which is same as min(w,b) ||w|| .
now minimizing ||w|| and (1/2)||w||2 would be the same problem ( to make further mathematics easier) . Hence our optimization problem boils down to:
lets solve this hard margin classifier .
QUESTION: HOW ARE CONSTRAINED OPTIMISATION PROBLEMS DIFFERENT THAN SIMPLE OPTIMISATION PROBLEMS ?
CONSTRAINED OPTIMISATION PROBLEMS ARE SOLVED USING THE CONCEPT OF LANGRANGE MULTIPLIERS.
THE DUAL FORM
SUBSTITUTING IN THE LANGRANGE EQUATION ABOVE YOU WILL HAVE :
WHICH REDUCES OUR DUAL PRBLEM TO :
SEE HOW NOW THIS PROBLEM DEPENDS ONLY ON THE LANGRANGE MULTIPLIERS . THE WAY THE ABOVE PROBLEM IS SOLVED WHEN YOU USE LIBRARIES IS PYTHON IS BY USING SOMETHING KNOWN AS SMO( Sequential minimal optimization algorithm) libraries like libsvm use this algorithm.
we know that in real life seldom do we have completely linearly separable data points. So we must have a variation which allows a little tolerance on errors . And of course this means some mathematical changes in our optimization problem . This change will also let us have a control on how much error is tolerated and hence will act as a regularization parameter for our objective function.
Imagine you have a strict dad who says you must score 100 in your maths paper (hard margin ). Somehow you are able to make that puppy face and hit his soft corner .He says that he will consider anything above 95 good . S o basically what you did was add a tolerance of 5 , in the same way in our soft-margin classifier we introduce a slack variable zeta for every point :
- but theoretically we can always choose large enough zetas to make our solution work . to prevent that from happening we can penalize large zeta values .
- we also must ensure that zetas are all positive.
- we add one more parameter C (regularization parameter) to keep a check on how important the zetas are
- keeping all the above points in mind we get to the following expression .
AND THE CORRESPONDING DUAL FORM AFTER APPLYING LANGRANGE MULTIPLIERS :
THE PARAMETER C SHOWS THE IMPORTANCE OF ZETAS . THE IMPORTANCE OF ZETAS IS INVERSELY PROPORTIONAL TO THE VALUE OF C . INFINITE POSITIVE C MEANS HARD MARGIN CLASSIFIER. REMEMBER THIS AND IT WILL HELP IN INTERVIEWS .
SOLVING NON-LINEAR PROBLEMS with SVMs
THE KERNEL TRICK ENABLES US TO PROJECT THE POINTS INTO HIGHER DIMENSION AND HENCE MAKE THE POINTS WHICH WERE SEEMINGLE UNSEPARABLE BY A HYPERPLANE IN LOWER DIMENSION BECOME SEPARABLE IN HIGHER DIMENSIONS:
MATHEMATICALLY IN THE DUAL FORM WE TRY TO DEFINE A KERNEL FUNCTION K WHIC WHICH CALCULATES THE DOT PRODUCT xi.xj in higher dimensions. hence our dual function becomes:
there are different varieties of kernel functions available .
- the linear kernel
- the polynomial kernel
- THE RBF KERNEL
ADVANTAGES AND DISADVANTAGES of SVMs
- CAN SOLVE NON LINEAR PROBLEMS .
- RUN TIME DEPENDS SOLELY ON SUPPORT VECTORS AS ALPHA VALUES FOR OTHER POINTS ARE ZERO
- DIFFERENT KERNEL FUNCTIONS MAKE IT VERSATILE.
- WORKS WELL IF DIMENSIONALITY IS MORE THAN NUMBER OF SAMPLES.
- PRONE TO OUTLIERS.
- TUNING AND SELECTING KERNEL AND HYPERPARAMETRS.
- HIGH TRAINING TIME . HENCE SCALIBILITY ISSUES .
- PERFORMANCE DEGRADES IF OVERLAPPING CLASSES IS MORE(NOISE)
A NOTE FROM THE DOCUMENTATION PAGE OF SCIKITLEARN (REGARDING OUTPUT ANF FEATURE IMPORATANCES: