a24e7cfe32518a5d7a048b9f2592a986

PCA for interviews-Eigenvectors/Langrangian

PCA, The math behind constrained optimisations , drawbacks and other important discussions

One of the most important mathematical concept required to understand the working of PCA is the concept of Eigenvectors and eigenvalues . If you know the algorithm , you must be aware that in order to reduce the dimensions of a certain dataset with dimension d to a smaller dimension k , PCA works by calculating the eigenvectors/values of the covariance matrix of the data set and select the top k values in descending order.

But why ? Why does the concept of eigenvalues and eigenvectors , gets involved during reduction of dimensions.

What are Eigenvectors/eigenvalues?

For any square matrix A , its Eigenvectors and eigenvalues have the following relation:

Geometrically speaking , one can see that ,for given eigenvector-value pair , the matrix multiplication is equal to a mere scaling of the eigenvector by a factor lambda . Hence its direction is unchanged .

Eigenvalues and eigenvectors are only for square matricesEigenvectors are by definition nonzero. Eigenvalues may be equal to zero.

Now lets look at another concept used in PCA

Constrained Optimisation

Sometimes , when one needs to find maxima/minima of any expression which in turn is following a certain constraint , you cannot equate its first derivative and equate it to zero. You must take care of the constraint given .

Such constrained optimisation problems are solved using the concept of LANGRANGE MULTIPLIERS.

Note: Langrange multipliers are always positive .

The optimisation problem in PCA

Now lets see how the 2 concepts , Eigenvector/values and Langrangian multipliers are required to solve the PCA optimisation problem . Recall that how PCA is all about finding axes with maximum variance . Below is the optimisation equation of PCA, where S is the covariance matrix of the dataset , hence a d*d dimension square matrix (where d is the original dimension of the dataset) and u is the direction along which the variance has to be maximized.

(MAXIMISING VARIANCE WITH CONSTRAINT ON u)

you can see that the above equation is just a constrained optimisation problem , using langrange multiplier lambda in langrangian L(u,lambda) and equating dL(u,lambda)/du and dL(u,lambda)/d(lambda) to zero you can see that solving the above equation will give the solution of the form

S*u=lambda*u (eigenvector-eigenvalue relation)

so now you see how the solutions u are just eigenvectors of S .( And lambda the corresponding eigenvalues) Now these eigenvalues with decreasing order will give decreasing variance , hence for reducing d into k smaller dimensions we select the top k eigenvalues in descending order .

This post deals with the mathematics behind the optimisation of PCA algorithm. Other important interview questions are related to its limitations , comparision to tsne , and more!

More posts :

Till then enjoy this

2168f0bc60c2d616ffacfa13b8caa0e3

TIME COMPLEXITIES OF ML ALGORITHMS

THIS ARTICLE SUMMARISES THE TIME COMPLEXITIES OF THE MOST COMMONLY USED AND ENQUIRED MACHINE LEARNING MODELS.

TIME COMPLEXITIES ARE IMPORTANT TO REMEMBER IN RELATION TO THE FOLLOWING QUESTIONS ASKED:

  1. WHEN WOULD YOU CHOOSE ONE MODEL OVER THE OTHER.
  2. IS A PARTICULAR ALGORITHM GOOD FOR INTERNET APPLICATIONS ?
  3. WHETHER AN ALGORITHM SCALES GOOD ON LARGE DATSETS/DATABASES.
  4. IS RE-TRAINING A PARTICULAR MODEL COSTLY IN TERMS OF RESOURCES.
  5. IF INPUT DISTRIBUTION IS CHANGING CONSTANTLY WHICH MODEL IS PREFERABLE.

LETS START OUR DISCUSSION ON TIME COMPLEXITIES . I WILL ASSUME YOU KNOW THE BASIC INTUITION OF ALL THE ALGORITHMS AND HENCE THIS WILL JUST BE A CRISP TO THE POINT SUMMARY !

REMEMBER , MANY TIMES YOU MIGHT FIND DIFFERENT ANSWERS ON THE INTERNET FOR SAME ALGO BECAUSE OF THE UNDERLYING DATA STRUCTURE OR CODE OPTIMISATION USED. OR MAYBE BECAUSE THEY ARE STATING BEST /WORST CONDITIONS . BEST APPROACH IS TO KNOW ONE COMPLEXITY AND THE CORRESPONDING PROCEDURE USED .

LINEAR REGRESSION

TERMINOLOGY :

DATASET = {X,Y} DIMENSIONS OF X= n*m , DIMENSIONS OF Y = n*1

train -time complexity = O((m^2)*(n+m))
run -time complexity= O(m)
space complexity(during run time)= O(m) 

conclusion: small run time and space complexity! good for low latency problems .



source:https://levelup.gitconnected.com/train-test-complexity-and-space-complexity-of-linear-regression-26b604dcdfa3#:~:text=So%2C%20runtime%20complexity%20of%20Linear,features%2Fdimension%20of%20the%20data.

LOGISTIC REGRESSION

TERMINOLOGY:

DATASET = {X,Y} DIMENSIONS OF X= n*m , DIMENSIONS OF Y = n*1

train -time complexity = O(n*m)
run -time complexity= O(m)
space complexity(during run time)= O(m) 

conclusion: small run time and space complexity! good for low latency problems .

NAIVE BAYES

DATASET = {X,Y} DIMENSIONS OF X= n*d , DIMENSIONS OF Y = n*1 CLASSES =C

train -time complexity = O(n*d*c)
run -time complexity= O(d*c)
space complexity(during run time)= O(d*c) 

conclusion: good for high dimensional data set .used in spam detection and other simple non semantic NLP problems . considered as base model for comparision with  complex models.


SVM

DATASET = {X,Y} DIMENSIONS OF X= n*d , DIMENSIONS OF Y = n*1 S=NUMBER OF SUPPORT VECTORS

train -time complexity = O(n^2)
run -time complexity= O(S*d)
space complexity(during run time)= O(S) 

conclusion: high train time complexity .but low latency and space complexity. 


DECISION TREES +RF +GBDT

n=NUMBER OF SAMPLE DATA POINTS ,

d = THE DIMENSIONALITY

k =depth of decision tree

p = number of nodes

train -time complexity = O(n*log(n)*d)
run -time complexity= O(k)
space complexity(during run time)= O(p) 

NOW IF WE ARE TAKING OF RANDOM FOREST (M BASE LEARNERS)

train -time complexity = O(n*log(n)*d*m)
run -time complexity= O(k*m)
space complexity(during run time)= O(p*m) 

TALKING OF GBDT (m SUCCESIVE MODELS , b is the shrinkage factor(m such coefficients , it will ultimately be a constant addition in complexity and can be ignored asymptotically but its better to point out during interviews as it is what separates it from RF )
train -time complexity = O(n*log(n)*d*m)
run -time complexity= O(k*m)
space complexity(during run time)= O(p*m +b*m) 

K-NEAREST NEIGHBOURS(brute force)

K= number of nearest neighbours

d= dimensions

(no training phase)

run-time complexity = O(n*d) ( the "+kd" term is ignored asymptotically

space complexity(during run time)= O(n*d)
 
https://stats.stackexchange.com/questions/219655/k-nn-computational-complexity

kd-tree

K= number of nearest neighbours

d= dimensions

An algorithm that builds a balanced k-d tree to sort points has a worst-case complexity of O(kn log n).

run-time complexity(best-case)=O(2^d *(k)*log n)
(log n to find cells “near” the query point
 2d to search around cells in that neighborhood )

run-time complexity(worst-case)=O(k*n)
space-complexity= O(K·N) 

AS you can see not suitable for very large dimensions .

CLUSTERING ALGORITHMS

NAME OF ALGOTRAIN COMPL.SPACE-COMPLEXITYREMARK
K MEANSO(n*k*d*i)
n=points
d=dimensions
k=number of clusters
i=iterations
O(nd+kd) “kd” for centroidsASYMPTOTICALLY FOR SMALL d and k =O(nd)
hierarchial clusteringO(n^3)O(n^2)not suitable for low latency and low space problems
DBSCAN
O(n^2)
but can be made
O(nlogn)
in lower dimensions using efficient data structures
O(n) {d<<n}better than heirarchial {in terms of complexity}

SVD (SINGULAR VALUE DECOMPOSITION)

NOW THIS IS A TRICKY ONE, VARIOUS METHODS ARE AVAILABLE AND GENERALLY ON THE INTERNET YOU WILL FIND MULTIPLE ANSWERS TO THIS QUESTION :

BELOW I STATE WHAT WIKIPEDIA STATES , (IF YOU STATE ANY OTHER ANSWER IN INTERVIEW JUST BE SURE REMEMBER TO UNDERSTAND THE CORRESPONDING ALGO USED)

FOR One m\times n matrix, TIME COMPLXITY {\displaystyle O(mn^{2})} (m \geq n)

https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations

PCA ,t-SNE

N=NUMBER OF POINTS , D=DIMEMTIONALITY

TRAIN-TIME COMPLEXITY (PCA) =(ND×min(N, D) +D³)

EXPLANATION:

{The complexity of Covariance matrix computation is O(D^2(n)). Its eigenvalue decomposition is O(D^3).}

NOTICE THAT AFTER THIS WE JUST SELECT FIRST d (TARGET DIIMENSION ) VECTORS

The main reason for t-SNE being slower than PCA is that no analytical solution exists for the criterion that is being optimised. Instead, a solution must be approximated through gradient descend iterations.

t-SNE has a quadratic time and space complexity in the number of data points.

T-SNE requires O( 3N^2) of memory.

https://www.geeksforgeeks.org/ml-t-distributed-stochastic-neighbor-embedding-t-sne-algorithm/

SUGGEST MORE ALGORITHMS AND I WILL ADD THEM IN HERE!

1c1cec8f6953f5900017c1a72921f192

Gaussian NAIVE BAYES, continuous features

There are different variants of Naive bayes , bernoulli, and Gaussian. This article assumes you are familiar with the the basic idea behind Naive bayes and also how it works on categorical data .

Here we discuss one of the approaches used for handling continuous variables when it comes to naive bayes.

Suppose we have the following dataset , where the target variable is whether a movie will be hit or not and the feature variables are the action rating and story rating (a whole numbers between 1 to 10)

ACTION RATING (AR)STORY RATING (SR)HIT/FLOP
7.25.8HIT
3.46.3FLOP
3.57.3FLOP
8.58.0HIT
6.92.8FLOP
7.05.3HIT
9.03.8HIT

NOW LETS SUPPOSE WE HAVE A TEST POINT : ACTION RATING=7.6 , STORY RATING= 5.7 . So these are what we need to predict:

P(HIT| AR= 7.6, SR=5.7) and P(FLOP| AR=7.6, SR=5.7)

LETS START BY CONSIDERING THE FIRST PROBABILITY EXPRESSION

BUT NO SUCH POINT IS PRESENT IN THE DATA SET , SO SHOULD WE SET THIS PROBABILITY TO ZERO? AND SIMILAR WITH THE SECOND EXPRESSION? THIS WOULD MEAN THAT ANY UNSEEN POINT WOULD ALWAYS LEAD TO BOTH PROBABILITIES TURNING TO ZERO. SO HOW DO WE RESOLVE THIS ISSUE ? LETS GET THERE.

GAUSSIAN DISTRIBUTION

There are 3 expressions that are needed to be evaluated in the below expression

P(HIT| AR= 7.6, SR=5.7) = P( AR= 7.6|HIT) * P( SR= 5.7|HIT) * P(HIT)

the P(HIT) calculation is straightforward and is equal to {total number of hits/Total number of hits and flops}.

For calculating the 2 left conditional probabilities we assume that the values in the data set are sampled from a gaussian distribution with mean and variance calculated from the sample points . To recall , this is what a Gaussian distribution looks like:

Now once we have the gaussian distribution for our column feature , we can get the pdf value for any point , whether it is present in our data set or not .

IMPORTANT POINTS TO BE NOTED:

  1. While calculating P(HIT| AR= 7.6, SR=5.7), the gaussian distribution will be made only using data points where output =HIT
  2. different distributions are calculated/obtained for every column and target variable , so here there will be 4 distributions used ; whose data points are from AR FOR HIT, AR FOR FLOP , SR FOR HIT ,SR FOR FLOP

WHAT TO CHECK ? BOX COX TRANSFORM- AN IMPORTANT TOOL

For applying Naive bayes we assumed that in any feature, points will come from a GAUSSIAN DISTRIBUTION . But what if it is not the case . Following are a few explanations and points that you need to follow :

  1. You can always plot the dist-plot and see whether the distribution is gaussian or not .
  2. Before applying Gaussian Naive Bayes you can use Box-Cox transform to make the distribution normal.
  3. If you see that columns are varying hugely from gaussian distribution you an use different distributions , other distributions are log-normal( also box-cox with gamma=0 gives log-normal distribution) , power law etc. Below you see the general expression used in box cox distribution , you can see how gamma=0 turns it into a log distribution.

With the above points in mind you are ready to use Gaussian Naive Bayes!! You can read more about Box cox transform here :

More to come!

09tmag-brad-slide-ZPNA-superJumbo

Splitting nodes in DT for continuous features(classification)

Splitting nodes in decision trees (DT) when a given feature is categorical is done using concepts like entropy, Information Gain and Gini impurity.

But when the features are continuous , how does one split the nodes of the decision tree? I assume you are familiar with the concept of entropy .

Suppose that we have a training data set of n sample points . let us consider one particular feature f1 which is continuous in nature .

Approach for splitting nodes

  1. We need to perform splitting of nodes for all sample points .
  2. we sort the f1 column in ascending order .
  3. then taking every value in f1 as a threshold, calculate the entropy and then an Information Gain.
  4. we select the threshold with the most information gain and make a split.
  5. we then continue to do the same for leaf nodes until either max_depth is reached or min_samples required to reach is more than sample points .

Lets try to understand the above by one example :

let the following be the f1 feature column and let say its a two class classification problem:,

F1(NUMERICAL FEATURE)TARGET VARIABLE/LABEL
5.4YES
2.8NO
3.9NO
8.5YES
7.6YES
5.9YES
6.8NO

WE START BY SORTING THE FEATURE VALUES IN INCREASING ORDER:

SORTED F1TARGET VARIABLE/LABEL
2.8NO
3.9NO
5.4YES
5.9YES
6.8NO
7.6YES
8.5YES
THE SORTED FEATURE COLUMN

NOW WE WILL CHOOSE EACH POINT AS THRESHOLD ONE BY ONE , 2.8 , 3.9 and so on . Below we display the splitting for one point , let say 5.4.

we perform similar splittings for all the data points , and whichever gives us the max IG is our first splitting point. If you cannot recall what IG is , this image might help:

ref: Quora

Now , for further splits , similar approach is repeated on leaf nodes .

DISADVANTAGE

There is one disadvantage of using the above stated process. The fact that if the data set you have is large , the computation requirements increases significantly. Imagine performing the above operation on millions of records and max_depth =10

Although we could handle the problem by feature binning and converting the numerical features into categorical .

More to come!

photo-1614771797949-e486f4223276

KNN and Probability (interview questions)

Article on one of the most common interview question: How do you interpret KNN outputs?

KNN does not have a learning phase . Its a lazy algorithm that just finds the “k” nearest neighbours and performs the classification/regression task almost like a hard coded instruction and nothing “intelligent” seems to happen. While the idea behind the algorithm is pretty simple and straightforward ; it is this simplicity that leads to many possible questions , because when one tries to solve complex real life questions using such simple algorithms , many border cases must be considered.

lets try to answer a few .

We know KNN can be used to solve classification as well as regression. We start with problems faced during classification .

THE PROBLEM IN CLASSIFICATION

SUPPOSE YOU HAVE 4 CLASSES AND THE NUMBER OF NEAREST NEIGHBOURS (K) YOU CHOSE IS 30. FOR A CERTAIN POINT YOU GOT THE FOLLOWING RESULTS :

  1. CLASS 1 =10 NEIGHBOURS
  2. CLASS 2 =10 NEIGHBOURS
  3. CLASS 3= 6 NEIGHBOURS
  4. CLASS 4= 4 NEIGHBOURS

NOW WHAT SHOULD OUR TEST POINT BE CLASSIFIED AS ? FIRST LETS CONSIDER WHAT THE NEIGHBOURS FROM THE 3RD AND 4TH CLASS TELL US. CONSIDER A MEDICAL CASE , LET SAY CANCER DETECTION , WHICH CONSISTS OF 2 OUTPUT CLASSES , C1 AND C2 .

YOU USED KNN where k=10 AND GOT 6 AND 4 POINTS RESPECTIVELY AS THE NEIGHBOURS . JUST BECAUSE YOU HAVE MORE NEIGHBOURS OF CLASS 1 CAN YOU RULE OUT THE POSSIBILITY OF THE SECOND CANCER CELL BEING PRESENT ?

THE ANSWER IS NO.

In many cases you require to look at the “probabilistic ” results rather than the final selection using more number of neighbours . Considering the above case of 2 classes , following is how the results would differ .

  1. if we use simple majority vote KNN : output : cancer of class 1
  2. if we use probability scores : output : 60 % percent chance of cancer 1 and 40% chance of cancer 2 .

Now lets get back to the 4 class problem , of course there the maximum neighbours are from class 1 and class 2 . Even using probability scores if we need to give one final output as a solution to a business problem , how can we break the tie . In such cases a lot depends on the domain of the problem , but lets discuss 2 generic ways that we can use to break such a tie.

USING WEIGHTED KNN

FROM CLASS 1 AND CLASS 2 WE CAN SELECT THE FINAL OUTPUT BY USING WEIGHTED KNN .In Weighted KNN we assign more weightage to points that are closer to the test point . So instead of just counting neighbours we can assign “an inverse distance relation” while calculating distance scores .

ref: https://slideplayer.com/slide/4808190/

INCREASE K BY A CERTAIN VALUE AND RECHECK

We used k=30 in the case provided. To beat the tie lets consider we use K=32 or 34 , now calculating the number of neighbours will remove the tie .

THE PROBLEM IN REGRESSION

In KNN ,regression refers to returning either the average /Median of a certain continuous value associated with the nearest neighbours. The problem here is not related to probability , rather it is related to the presence of outliers .

An outlier can mess up the average score but median score would be more robust to such issues .

So in summary a KNN (where k=n) works well only if there is always one such class which dominates the rest n-1 classes in terms of majority . And remember because there is no such thing as ” training” in KNN hence one can do nothing except changing k values if you find the neighbours distributed randomly in n classes.

This article focuses on one of the many problems that one can face during interviews. Other problems and their solutions like kmeans++, kd-trees and more would be discussed in subsequent posts.