### ML INTERVIEW QUESTIONS -PART 1

IN THESE INTERVIEW PREP SERIES WE LOOK AT IMPORTANT INTERVIEW QUESTIONS ASKED IN DATA SCIENTIST /ML AND DL ROLES .

IN EACH PART WE WILL DISCUSS FEW ML INTERVIEW QUESTIONS.

#### 1) What is PDF?

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable. PDF for an interval indicates the probability of the random variable falling within the interval.

#### 2) What is Confidence Interval?

confidence interval displays the probability that a parameter will fall between a pair of values around the meanConfidence intervals measure the degree of uncertainty or certainty in a sampling method.

#### 3) Can KL divergence be used as a distance measure?

No. It is not a metric measure as it is not symmetric.

#### 4) What is Log-normal distribution?

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

#### 5) What is Spearman Rank Correlation Coefficient?

Spearman Rank Correlation Coefficient is determined by applying Pearson Coefficient on rank encoded random variables.

#### 6) Why is “Naive” Bayes naive?

The conditional independence of the variables of a data frame is an assumption in Naive Bayes which can never be true in practice. The conditional independence assumption is made to simplify the computations of the conditional probabilities. Naive Bayes is naive due to this assumption.

#### 7)What is the “Crowding problem” in t-sne?

This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map.

#### 8)What are the limitations of PCA?

PCA should be used mainly for variables which are strongly correlated.

If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine.

PCA Results Are Difficult To Interpret Clearly.

#### 9)Name 2 failure cases of KNN?

When query point is an outlier or when the data is extremely random and has no information.

#### 10) Name 4 assumptions of linear regression

• Linear relationship
• Multivariate normality
• No or little multicollinearity
• No auto-correlation
• Homoscedasticity

#### 11)Why are log probabilities used in Naive -bayes?

The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

#### 12)How to handle Numerical features in(Gaussian NB)?

Numerical features are assumed to be Gaussian. Probabilities are determined by considering the distribution of the data points belonging to different classes separately.

#### 13)How do you get a feature important in naive Bayes?

The naive bayes classifers don’t offer an intrinsic method to evaluate feature importances. Naïve Bayes methods work by determining the conditional and unconditional probabilities associated with the features and predict the class with the highest probability.

#### 14)Differentiate between GD and SGD.

n both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

In SGD only one data point is used per iteration to calculate the value of the loss function. While for GD all the data points are used to calculate the value of the loss function

#### 15)Do you know the train and run time complexities for a SVM model?

Train time complexity O(n2)

Run time complexity O(k*d)

k=number of support vectors, d=dimensionality of data set

#### 16)Why is RBF kernel SVM compared to kNN?

They are not that similar, but they are related though. The point is, that both kNN and RBF are non-parametric methods to estimate the density of probability of your data.

Notice that this two algorithm approach the same problem differently: kernel methods fix the size of the neighborhood (h) and then calculate K, whereas kNN fixes the number of points, K, and then determines the region in space which contain those points.

#### 17)What decides overfitting and underfitting in DT?

the max_depth parameter decides the overfitting and underfitting in Decision Trees.

#### 18)What is Non negative Matrix Factorization?

decomposing a matrix into 2 smaller matrices with all elements greater than zero and whose product gives us the original matrix.

#### 19)what is Netflix prize problem?

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

#### 20) What are word embeddings?

Word embeddings are a type of word representation IN A VECTOR SPACE that allows words with similar meaning to have a similar representation.

### THE MATH BEHIND SVMs , Langrangian ,kernel

In this article we discuss the important mathematical tools used in formulation of SVMs and also point out the important questions asked in interviews .

SVMs are one of those algorithms which people use as a black box , but interviewers might end up asking the details of the mathematics involved And in general its always better to know the math behind any algorithm.

### THE PROBLEM STATEMENT

Classification: THE GOAL OF SVM IS TO FIND A HYPERPLANE IN THE d DIMENSIONAL VECTORSPACE( dimensionality of the data set) WHICH SEPARATES THE DATA WITH “MOST MARGIN”, OR IN OTHER WORDS “OPTIMALLY” ( as multiple hyperplanes can separate the points , we desire the best out of all those”

QUESTION-WHAT IS HARD MARGIN SVM?

IN HARD MARGIN SVM WE HAVE ZERO TOLERANCE FOR ANY ERROR , IT TRIES TO FIND THE OPTIMAL HYPERPLANE ASSUMING ALL POINTS WILL BE CORRECTLY CLASSIFIED AT THE END .

#### OPTIMISATION STATEMENT

SO WHAT ARE WE TRY TO OPTIMIZE IN SVMs?

lets consider a point whose actual class is y( -1 or 1) , lets consider the predicted value for that point is (w.x+b) . note that there are 4 cases possible :

1. the product y*(w.x+b) is positive because y is “+1” and predicted distance from plane is also positive.
2. the product y*(w.x+b) is positive because y is “-1” and predicted distance from plane is also negative.
3. the product y*(w.x+b) is negative because y is “+1” and predicted distance from plane is negative.
4. the product y*(w.x+b) is negative because y is “-1” and predicted distance from plane is positive.

you can see that everytime there is a wrong classification we see this product is negative. and every correct classification means positive product .

So we can start by saying that our optimal hyperparameter will try to maximize the product to positive for all the points , from now we will use w/||w|| to introduce scale invariancy in the hyperplane.

now a bit tricky statememt , ” the most optimal hyperplane is that whose min. value of the above product( which will occur for the closest point) should be largest among all hyperplanes ” . In layman terms , the closest point (whose distance is called margin ) should be maximised . So the following is our optimisation problem :

where m is the margin , (distance from the closest point) . the hyperplane with the max M will be selected.

where gamma is the product function discussed above. now we try to tweak the above expression. remember from your class 12th 3-d geometry classes that scaling the w and b wont change the final optimal hyperplane , so if we rescale the w and b to make the min product=1( but why? so that our optimization problem can be reduced to a single variable ||w|| ! ). Now we will be left only with

max(w,b) 1/||w|| subject to the same conditions as above , which is same as min(w,b) ||w|| .

now minimizing ||w|| and (1/2)||w||2 would be the same problem ( to make further mathematics easier) . Hence our optimization problem boils down to:

lets solve this hard margin classifier .

## CONSTRAINED OPTIMIZATION

QUESTION: HOW ARE CONSTRAINED OPTIMISATION PROBLEMS DIFFERENT THAN SIMPLE OPTIMISATION PROBLEMS ?

CONSTRAINED OPTIMISATION PROBLEMS ARE SOLVED USING THE CONCEPT OF LANGRANGE MULTIPLIERS.

## THE DUAL FORM

SUBSTITUTING IN THE LANGRANGE EQUATION ABOVE YOU WILL HAVE :

WHICH REDUCES OUR DUAL PRBLEM TO :

SEE HOW NOW THIS PROBLEM DEPENDS ONLY ON THE LANGRANGE MULTIPLIERS . THE WAY THE ABOVE PROBLEM IS SOLVED WHEN YOU USE LIBRARIES IS PYTHON IS BY USING SOMETHING KNOWN AS SMO( Sequential minimal optimization algorithm) libraries like libsvm use this algorithm.

we know that in real life seldom do we have completely linearly separable data points. So we must have a variation which allows a little tolerance on errors . And of course this means some mathematical changes in our optimization problem . This change will also let us have a control on how much error is tolerated and hence will act as a regularization parameter for our objective function.

Imagine you have a strict dad who says you must score 100 in your maths paper (hard margin ). Somehow you are able to make that puppy face and hit his soft corner .He says that he will consider anything above 95 good . S o basically what you did was add a tolerance of 5 , in the same way in our soft-margin classifier we introduce a slack variable zeta for every point :

1. but theoretically we can always choose large enough zetas to make our solution work . to prevent that from happening we can penalize large zeta values .
2. we also must ensure that zetas are all positive.
3. we add one more parameter C (regularization parameter) to keep a check on how important the zetas are
4. keeping all the above points in mind we get to the following expression .

### PRIMAL FORM

AND THE CORRESPONDING DUAL FORM AFTER APPLYING LANGRANGE MULTIPLIERS :

THE PARAMETER C SHOWS THE IMPORTANCE OF ZETAS . THE IMPORTANCE OF ZETAS IS INVERSELY PROPORTIONAL TO THE VALUE OF C . INFINITE POSITIVE C MEANS HARD MARGIN CLASSIFIER. REMEMBER THIS AND IT WILL HELP IN INTERVIEWS .

## SOLVING NON-LINEAR PROBLEMS with SVMs

THE KERNEL TRICK ENABLES US TO PROJECT THE POINTS INTO HIGHER DIMENSION AND HENCE MAKE THE POINTS WHICH WERE SEEMINGLE UNSEPARABLE BY A HYPERPLANE IN LOWER DIMENSION BECOME SEPARABLE IN HIGHER DIMENSIONS:

MATHEMATICALLY IN THE DUAL FORM WE TRY TO DEFINE A KERNEL FUNCTION K WHIC WHICH CALCULATES THE DOT PRODUCT xi.xj in higher dimensions. hence our dual function becomes:

there are different varieties of kernel functions available .

1. the linear kernel
2. the polynomial kernel
3. THE RBF KERNEL

1. CAN SOLVE NON LINEAR PROBLEMS .
2. RUN TIME DEPENDS SOLELY ON SUPPORT VECTORS AS ALPHA VALUES FOR OTHER POINTS ARE ZERO
3. DIFFERENT KERNEL FUNCTIONS MAKE IT VERSATILE.
4. WORKS WELL IF DIMENSIONALITY IS MORE THAN NUMBER OF SAMPLES.
5. PRONE TO OUTLIERS.

1. TUNING AND SELECTING KERNEL AND HYPERPARAMETRS.
2. HIGH TRAINING TIME . HENCE SCALIBILITY ISSUES .
3. PERFORMANCE DEGRADES IF OVERLAPPING CLASSES IS MORE(NOISE)

A NOTE FROM THE DOCUMENTATION PAGE OF SCIKITLEARN (REGARDING OUTPUT ANF FEATURE IMPORATANCES:

## DESCRIBING POLICIES AND VALUE FUNCTIONS OF A MARKOV DECISION PROCESS

THE LAST ARTICLE (FINITE MARKOV DECISION PROCESS -PART 1 ) WAS FOCUSED MAJORLY ON DESCRIBING WHAT WE MEAN BY RETURNS, AN ENVIRONMENT , AND AN ENVIRONMENT STATE FOLLOWING THE MARKOV PROPERTY . SO WHATS NEXT . NOW WE GET INTO THE MATHEMATICS OF POLICIES AND VALUE FUNCTIONS . WE SEE HOW TO DEFINE THESE MATHEMATICALLY AND THEN SEE THE ASSOCIATED BELLMAN EQUATIONS . LATER WE SEE THE OPTIMISED POLICIES AND BELLMAN FUNCTIONS AND HOW THEY ULTIMATELY LEAD US TO THE FINAL FUNCTION WE ARE LOOKING FOR . THEN WE SEE WHAT PROBLEMS SUCH SOLUTIONS PRESENT AND HOW WE CAN APPROXIMATE THE FUNCTION TO MEET OUR DEMANDS . IF YOU ARE NOT FAMILIAR WITH THE TERMINOLOGIES I WILL RECOMMEND REFERRING THE PREVIOUS TWO ARTICLES .

## POLICIES AND VALUE FUNCTIONS

THE VERY PURPOSE OF ANY REINFORCEMENT PROBLEM IS TO CALCULATE VALUE FUNCTIONS RELATED TO A STATE OR A STATE ACTION PAIR . WHICH IS A MEASURE OF HOW “GOOD”/”BAD ” IT IS FOR AN AGENT IF IT WERE TO START FROM THE PARTICULAR STATE . MATHEMATICALLY THIS GOOD/BAD HAS TO DO WITH WHAT FUTURE REWARDS ONE CAN EXPECT STARTING FROM A GIVEN STATE . THE POLICY FUNCTIONS ARE DEFINED USING WHAT ARE CALLED POLICIES . WE DEFINED POLICIES IN THE FIRST ARTICLE OF REINFORCEMENT LEARNING AS A MAPPING OF STATES TO PROBABILITIES OF SELECTING EACH POSSIBLE ACTION . BELOW IS A FORMAL DEFINITION OF A POLICY , TAKEN FROM AN INTRODUCTION TO REINFORCEMENT LEARNING BY BARTON AND SUTTON :

## THE MATHEMATICS OF VALUE FUNCTIONS

SUPPOSE THE AGENT IS IN A STATE St . THE VALUE FUNCTION FOR THE STATE IS THE EXPECTED RETURN VALUE , GIVEN THE AGENT FOLLOWS A CERTAIN POLICY . MATHEMATICALLY THE VALUE FUNCTION FOR A MARKOV DECISION PROCESS IS :

## E REFERS TO THE EXPECTATION VALUE , Gt IS THE RETURN FUNCTION

THE V pi(s) IS KNOWN AS THE STATE VALUE FUNCTION FOR A POLICY pi . WE DEFINE YET ANOTHER TYPE OF VALUE FUNCTION THAT INCORPORATES THE ACTION TAKEN FROM A CERTAIN STATE FOLLOWING A PARTICULAR POLICY . THIS TYPE OF VALUE FUNCTION KNOWN AS ACTION VALUE FUNCTION , IS DEFINED AS :

## PROPERTY OF VALUE FUNCTIONS

WE INTRODUCE AN IMPORTANT FEATURE OF A VALUE FUNCTION WHICH WILL BE USED IN THE ENTIRE REINFORCEMENT LEARNING ARTICLES AND IN DYNAMIC PROGRAMMING . AND THIS FEATURE IS THE RECURSIVE NATURE OF VALUE FUNCTIONS . SO FOR ANY STATE St AND THE ONES THAT COME IN FUTURE BY FOLLOWING A POLICY (pi) THE FOLLOWING CONDITION STANDS TRUE AS A RESULT OF THE RECURSIVE NATURE :

THE ABOVE EQUATION IS THE BELLMAN EQUATION WE MENTIONED IN THE BEGINNING OF THIS ARTICLE . NOTICE HOW WE STARTED WITH THE STATE VALUE FUNCTION , REPRESENTED THE RETURN FUNCTION AS A FUNCTION OF RETURN FUNCTION AT A FUTURE STEP AND THEN JUST USED THE FACT WE STATED IN THE PREVIOUS ARTICLE . THE BELLMAN EQUATION IS JUST THE PRODUCT OF ALL POLICIES FROM S AND THE SUM OF ALL PROBABILITIES OF TRANSITIONING USING ACTIONS AND GAINING REWARDS . WE KNOW USE THIS RECURSIVE BELLMAN EQUATION TO FIGURE OUT THE OPTIMAL POLICIES AND VALUE FUNCTIONS .

## OPTIMAL POLICIES AND VALUE FUNCTIONS

NOW OF ALL THE POLICIES AND VALUE FUNCTIONS WE WANT THE BEST ONES!! . THE BEST OPTIMAL POLICIES AND FUNCTIONS ARE JUST THE ONES THAT HAVE MAXIMUM VALUES AND HENCE GIVE MAXIMUM RETURN IN THE LONG RUN . THIS IS EXACTLY WHAT AN AGENT DESIRES . NOW WE LOOK AT THE TWO OPTIMAL EQUATIONS , PRACTICALLY THEY ARE NOTHING DIFFERENT FROM THE PREVIOUS ONES , WE JUST HAVE ADDED A “MAX” FUNCTION TO THEM :

### THE OPTIMAL ACTION VALUE FUNCTION . THIS IS WHAT THE AGENT ULTIMATELY HAS TO FIND OUT .

THE ABOVE BELLMAN EQUATION IS THE ONE WE WERE LOOKING FOR !!! BUT…………

THIS COMES AT A COST! THEORETICALLY WE HAVE THE ANSWER . BUT IN REALITY THIS EQUATION IS VERY DIFFICULT TO SOLVE . MEMORY IS A BIG CONSTRAINT IN COMPUTATION AS THE DIMENSIONS INCREASE AND THE NUMBER OF ALL POSSIBLE STATES SHOOT UP EXPONENTIALLY . REMEMBER USING DYNAMIC PROGRAMMING / LOOK UP TABLES IS THE WAY BUT EVEN SMALL GAMES LIKE CHESS BECOME HUGE IN TERMS OF COMPUTATION AND MEMORY AS THE DIMENSIONS INCREASE .

HENCE WE HAVE TO SETTLE FOR APPROXIMATIONS . MOREOVER MARKOV DECISION MAKING BECOMES CUMBERSOME ONCE DIMENSIONALITY INCREASES . IN THE NEXT ARTICLE WE WILL EXPLORE AND TRY TO SOLVE A PROBLEM USING MARKOV PROPERTY . THEN WE WILL HEAD TO MONTE CARLO METHODS OF SOLVING SUCH PROBLEMS .

HAPPY LEARNING!!! MAKE SOME MEMORIES , WE MIGHT NEED THEM!!

## DESCRIBING THE MARKOV DECISION PROCESS IN MACHINE LEARNING

IN THIS POST WE DISCUSS A CERTAIN TYPE OF REINFORCEMENT PROBLEM . WE DEFINE A MERKONIKOV PROCESS AND THEN WE GET TO THE MATHS . BUT AS ALWAYS FIRST WE NEED TO UNDERSTAND A FEW DEFINITIONS . THEN WE TRY TO RELATE WITH A REAL LIFE PROBLEM . WE HAVE ALREADY SEEN HOW AN AGENT INTERACTS WITH ITS ENVIRONMENT AND TRIES TO MAXIMISE HIS LONG TERM GAINS . FOR ANY PROBLEM WE REPRESENT “

1. THE SET OF ALL POSSIBLE STATES BY S AND AND THE STATE AT TIME t BY St
2. THE SET OF ALL POSSIBLE ACTIONS BY A AND THE STATE AT TIME t BY At
3. THE POLICY MAPPING IN A STATE S TO AN ACTION A BY pi(a|s)

## RETURNS

WE HAVE SEEN THAT THE GOAL OF AN AGENT IS TO MAXIMISE THE RETURNS . RETURNS REFER TO THE LONG TERM REWARDS STARTING FROM A PARTICULAR STATE . SO HOW TO REPRESENT IT MATHEMATICALLY. AS STATED IT MUST BE SOME FUNCTION OF PRESENT AND FUTURE REWARDS

SO WHAT IS A MARKOV PROCESS ? WELL WE NEED TO CONSIDER TWO THINGS HERE . FIRST THAT THE AGENT KNOWS WELL WHAT THE ENVIRONMENT EXACTLY IS . SECOND IF AT ANY TIME t , THE AGENT IS IN A STATE S , THE EXPECTED RETURNS FROM THERE ONWARDS DEPENDS ONLY ON THAT STATE AND THE ACTIONS TAKEN FROM THERE . THAT IS HOW YOU ENDED UP IN THAT STATE WON’T EFFECT THE FUTURE RETURNS . THIS IS HOW IT LOOKS LIKE :

## THE DISCOUNT PARAMETER

NOTICE THE PARAMETER GAMMA AND ITS SUBSEQUENT POWERS IN THE EXPRESSION? GAMMA IS REFERRED AS THE “DISCOUNT ” PARAMETER . IT IS A MEASURE THAT ALLOWS US TO SHOW THE DIMINISHING IMPORTANCE OF FUTURE REWARDS . A SMALL GAMMA WOULD TEND TOWARDS A GREEDY APPROACH WHERE THE RETURN FUNCTION FOCUSES ON THE MORE IMMEDIATE REWARDS . IF GAMMA =0 , THE AGENT IS SAID TO BE “MYOPIC ” THAT IS , FOCUSED ON JUST THE IMMEDIATE REWARD . SECONDLY GAMMA HELPS IN CONVERGENCE OF THE RETURN FUNCTION IF THE NUMBER OF TOTAL ALLOWED STEPS /ACTIONS IS NOT PREDEFINED . IN SUCH CASES , IN ABSENSE OF GAMMA , THE RETURN FUNCTION WOULD TURN INTO AN INFINITE SUM .

## THE MARKOV PROPERTY

A MARKONIKOV PROPERTY PROPERTY COMES UNDER A MORE GENERAL DEFINITION OF AN ENVIRONMENT . LETS DISCUSS WHAT DO WE MEAN BY A GENERAL ENVIRONMENT . SUPPOSE THE AGENT IS CURRENTLY IN A STATE St .

NOW IN A MORE GENERAL SENSE WHAT HAPPENS NEXT WOULD DEPEND IN HOW YOU ACTUALLY MADE UP TO THAT PARTICULAR STATE . THAT IS WHAT HAPPENS IS NOT ONLY A FUNCTION OF THE PRESENT STATE , RATHER THAN DEPENDING ON THE WHOLE CUMULATIVE PAST .

A GENERALISED REAL LIFE ENVIRONMENT

SUPPOSE YOU QUALIFIED THE FIRST LEVEL OF A MCQ BASED EXAM . LET SAY THE PREPARATION FOR THE SECOND ROUND IS THE PRESENT STATE St . NOW WHETHER YOU WOULD EXCEL IN THE SECOND ROUND WOULD BE DEPENDENT ON HOW YOU PASSED THE FIRST ROUND . IF HALF OF THE QUESTIONS WERE A GUESS WORK AND YOU WERE LUCKY ENOUGH TO GET PAST THE CUT-OFF SCORE YOUR CHANCES OF QUALIFYING THE NEXT ROUND IS LOW . ON THE OTHER HAND IF YOU SOLVED EVERY QUESTION THEN YOUR CHANCES ARE HIGH .

LETS REPRESENT THE ABOVE STATEMENTS MATHEMATICALLY , SUPPOSE THE HISTORY OF REACHING A STATE St IS REPRESENTED BY ((S0 ,A0, R1) ,(S1 ,A1, R2) ………..(S(t-1) , A(t-1), R(t)) . FOR A GENERAL THE FUTURE RETURNS WOULD DEPEND ON THIS HISTORY . (SO IN THE ABOVE EXAM CASE HOW MANY ACTIONS WERE GUESS WORK AND HOW MANY WERE SOLVING AND ANSWERING )

NOW LETS DISCUSS A SPECIAL CASE OF THE ABOVE GENERAL ENVIRONMENT . SUPPOSE YOU ARE PLAYING A GAME OF CHESS , FOR A PARTICULAR STATE THE FUTURE DEPENDS SOLELY ON HOW YOU HAVE REACHED THERE , AND FROM THERE ONLY THE NEXT MOVES MATTER . ANOTHER EXAMPLE IS SUPPOSE YOU HAVE TO REACH MUMBAI FROM LUCKNOW VIA DELHI . NOW ONCE YOU HAVE REACHED DELHI AIRPORT . OPTIMISING THE TIME TAKEN TO REACH MUMBAI WONT DEPEND ON HOW YOU REACHED DELHI . USING THE ABOVE INTUITION WE DEFINE BELOW WHAT IS KNOWN AS A MARKOV PROCESS .

## THE MATH

WE DEFINE THE PROBABILITY AT STEP t OF TRANSITIONING TO A STATE (s’) FROM (s) BY TAKING AN ACTION (a ) AND GETTING A REWARD (r )

#### NOTICE HOW THE PROBABILITY DEPENDS ONLY ON THE PRESENT STATE AND THE NEXT ACTION

SO NOW WE KNOW WHAT A STATE WHICH FOLLOW MARKOV PROPERTY MEANS . NEXT WE DISCUSS POLICY FUNCTIONS AND VALUE FUNCTIONS WITH RESPECT TO MARKOV DECISION MAKING PROCESSES . WE WILL SEE WHAT DRIVES A REINFORCEMENT LEARNING AGENT AND WE WILL BE INTRODUCED TO ONE OF THE MOST VITAL EQUATIONS IN REINFORCEMENT LEARNING , THE BELLMAN EQUATION.

HAPPY LEARNING!!!

## THE WHAT AND HOW OF REINFORCEMENT LEARNING

APART FROM SUPERVISED AND UNSUPERVISED MACHINE LEARNING , THERE IS A THIRD KIND OF MACHINE LEARNING APPROACH , WIDELY KNOWN AS REINFORCEMENT LEARNING . SO HOW IS THIS APPROACH DIFFERENT FROM THE FORMER TWO . HERE IN THIS ARTICLE WE DISCUSS THE OVER VIEW OF REINFORCEMENT LEARNING ,THE IMPORTANT TOPICS AND THE BASIC APPROACH . THE BASIC DIFFERENCE BETWEEN THE SUPERVISED/UNSUPERVISED AND THE REINFORCEMENT LEARNING APPROACH IS THE ABSENSE OF A PREDEFINED /AVAILABLE DATA SET ON WHICH WE JUST HAVE TO FIT OUR MODEL .

THERE ARE NO CLEAR INSTRUCTIONS /ALGORITHMS THAT STATE EXACTLY WHAT EACH ITERATION HAS TO DO. TO MAKE THIS POINT CLEAR LETS TRY TO VISUALIZE A SINGLE REAL LIFE PROBLEM WITH A SUPERVISED APPROACH AND A REINFORCEMENT LEARNING APPROACH . SUPPOSE A CHILD WANTS TO LEARN HOW TO RIDE A BICYCLE . HOW WOULD A SUPERVISED LEARNING APPROACH LOOK LIKE :

1. TELLING HIM THE EXACT AMOUNT OF FORCE HE HAS TO PUT ON THE PEDAL FOR EVERY POSITION OF THE PEDAL.
2. WATCHING THOUSANDS OF VIDEOS OF PEOPLE CYCLING AND TRYING TO LEARN FROM IT .
3. A LOSS FUNCTION THAT SHOWS HOW MUCH THE CENTRE OF GRAVITY IS AWAY FROM THE IDEAL POSITION .

I’M SURE YOU DIDN’T LEARN CYCLING THIS WAY . (AND NO LUCKY CHILD EVER WILL!!!) . IN FACT THE ONLY THING THAT COMES INTO THE MIND AS A CHILD WHILE LEARNING TO CYCLE IS ” I DON’T HAVE TO FALL ” . AND HOW DO YOU LEARN ?

1. YOU BEGIN BY KNOWING NOTHING EXCEPT THE FACT THAT WHAT YOU HAVE TO DO . IT INCLUDES THE GOAL AND THE FACT THAT PEDDLING AND BALANCING ARE THE ONLY THINGS YOU CAN DO .
2. TRYING A FEW TIMES , MAYBE FALLING .
3. AFTER FALLING YOU AVOID WHAT YOU DID THAT RESULTED IN A FALL .
4. AND AFTER TRYING A NUMBER OF TIMES YOU SUCCEED .

### THE BASIC DIFFERENCE

ISN’T IT STRANGE THAT EVEN AFTER A CHILD MASTERS HOW TO RIDE HIS CYCLE HE HAS NO IDEA ABOUT THE RULES THAT GOVERN THE MECHANICS /DYNAMICS OF THE CYCLE . YET HE LEARNT HOW TO BALANCE THE FORCES, THE TORQUES AND HOW TO SHIFT HIS CENTRE OF GRAVITY WHILE MAKING A TURN. THIS IS THE PRINCIPLE BEHIND REINFORCEMENT LEARNING . RATHER THAN HAVING A PREDEFINED SET OF RULES IT IS RATHER ” EXPLORATORY ” .

THE “AGENT ” – THE ONE WHO IS TRYING TO REACH THE GOAL , TRIES ALL THE “ACTIONS” THAT HE CAN PERFORM IN THE “ENVIRONMENT ” AVAILABLE TO HIM , THE ENVIRONMENT MIGHT BE COMPLETELY “KNOWN” OR HE MIGHT FACE “UNKNOWN ” REGIONS . LATER EVEN AFTER HE LEARNS A WAY THAT TAKES HIM TO THE GOAL HE TRIES TO “EXPLORE ” NEW WAYS TO FIND IF BETTER WAYS EXIST . LOOK AT THESE TWO CHARACTERISTICS OF REINFORCEMENT LEARNING :

THIS “EXPLORATION ” IS ONE OF THE CHARACTERISTIC DIFFERENCES BETWEEN REINFORCEMENT LEARNING AND OTHER FORMS OF MACHINE LEARNING . THERE IS TRADE OFF BETWEEN EXPLOITATION(GREEDY APPROACH) OR EXPLORATION .

IN REINFORCEMENT LEARNING EXPLICITLY CONSIDERS THE WHOLE PROBLEM OF A GOAL DIRECTED AGENT , IN CONTRAST TO OTHER LEARNING APPROACHES WHICH TEND TO DIVIDE THE PROBLEM INTO SUB PROBLEMS AND TRY TO OPTIMISE THEM WITHOUT WORRYING ABOUT THE LARGER PICTURE .

## TOWARDS THE MATH

WE START BY ADDRESSING THE BASIC TERMINOLOGIES THAT WE WILL USE :

### POLICY

A POLICY IS HOW THE LEARNING AGENT BEHAVES AT A GIVEN TIME/STATE IN THE ENVIRONMENT. IN OTHER WORDS POLICY IS A MAPPING FROM A STATE TO THE ACTIONS TO BE TAKEN WHEN IN THOSE STATES . IT MAY BE A FUNCTION , A LOOK UP TABLE . IN SIMPLE WORDS ITS THE WAYS YOU CAN TAKE ACTION WHEN YOU FACE A SITUATION . FOR EXAMPLE IF A TIGER SUDDENLY APPEARS IN YOUR ROOM, RUNNING AND SHOUTING IS A POSSIBLE POLICY , WHILE GREETING THE TIGER IS NOT (JUST DON’T, NO).

### REWARD SIGNAL

IN REINFORCEMENT LEARNING A ” GOAL ” IS MATHEMATICALLY REPRESENTED AS A REWARD SIGNAL . ON EVERY ACTION TAKEN , THE ENVIRONMENT SENDS A NUMBER , CALLED A REWARD TO SHOW WHETHER THAT PARTICULAR ACTION WORSEN OR MADE THE SITUATION BETTER . THE AGENTS GOAL IS TO MAXIMISE HIS REWARDS OVER THE LONG RUN . A PROFITABLE ACTION(WINNING ) GIVES YOU A LARGE POSITIVE REWARD , GETTING INTO TROUBLE (UNDESIRED STATES) A LARGE NEGATIVE , WHILE TAKING STEPS AND YIELDING NOTHING MIGHT BE A SMALL NEGATIVE REWARD( AS TO OPTIMISE REDUNDANT ACTIONS).

### VALUE FUNCTION

REWARD SIGNAL SHOWS HOW GOOD / BAD AN ACTION IS IN AN IMMEDIATE SENSE , WHILE VALUE FUNCTION SHOWS A PICTURE IN THE LONG RUN . SO VALUE FUNCTION IS THE AMOUNT OF REWARD AN AGENT CAN EXPECT TO RECIEVE IN THE FUTURE STARTING FROM A PARTICULAR STATE . TO UNDERSTAND THIS SUPPOSE A GUY GETS SOME MONEY TO SPEND . IMMEDIATE “REWARDS” LIKE EXPENSIVE WATCHES , PARTIES MIGHT SEEM TO BE GOOD IN THE IMMEDIATE SENSE BUT SAVING NOW AND INVESTING WILL INCREASE HIS MONEY AND WILL BE MORE REWARDING IN THE LONG RUN .

### EXPLOITATION AND EXPLORATION

EXPLOITATION REFERS TO SELECTING THE BEST ALTERNATIVE FROM THE POLICIES WHEN PRESENT IN A PARTICULAR STATE . EXPLORATION REFERS TO RANDOMLY SELECTING ONE OF THE ACTIONS (EXPLORING) IN HOPE TO FIND BETTER REWARDS IN THE LONG RUN . THIS IS ONE OF THE STRIKING CHARACTERISITCS OF REINFORCEMENT LEARNING .

### RETURNS

THE CUMULATIVE REWARD FUNCTION THAT THE AGENT HAS TO MAXIMIZE.

NOW LETS HAVE A LOOK AS TO HOW THE AGENT AND ITS ENVIRONMENT INTERACTS :

### THE AGENT IN A STATE PERFORMS AN ACTION INTERACTING WITH THE ENVIRONMENT ,PRODUCING A NEW STATE AND A REWARD

ABOVE IS THE BASIC STRUCTURE OF HOW AN AGENT INTERACTS WITH THE ENVIRONMENT . FURTHER REINFORCEMENT LEARNING PROBLEMS CAN BE CATEGORISED INTO MANY CATEGORIES . WE WILL DISCUSS THE THREE MAJOR VARIETIES OF REINFORCEMENT LEARNING :

THE FINITE MARKOV DECISION PROCESSES WHICH WILL INCLUDE DISCUSSING VALUE FUNCTIONS , BELLMAN EQUATIONS , THE MARKOV PROPERTY AND RETURNS . FOLLOWED BY MARKOV DECISION PROCESS( MDP ) WE WILL DISCUSS THE MONTE CARLO METHODS , MONTE CARLO ESTIMATION OF ACTION VALUES , OFF POLICY AND ON POLICY APPROACHES AND MONTE CARLO CONTROL .

FOLLOWED BY MONTE CARLO WE WILL DISCUSS TEMPORAL DIFFERENCE LEARNING . WE WILL LOOK AT REAL LIFE PROBLEMS AND COMPARE THESE THREE TYPES OF LEARNING . THE MATHEMATICS MOSTLY USES PROBABILITIES AND MATRICES FOR LOOK UP /SEARCH TABLES .

### APPLICATIONS

CURRENTLY REINFORCEMENT LEARNING IS BEING EXTENSIVELY RESEARCHED FOR DEVELOPING MACHINES WHICH LEARN TO PLAY AND WIN AGAINST THE BEST PLAYERS OF THE WORLD . IN 2016, GOOGLE DEEP MIND TRAINED A REINFORCEMENT LEARNING MODEL “ALPHA ZERO ” WHICH LEARNED TO PLAY “GO” . GO IS ONE OF THE MOST COMPUTATIONALLY DIFFICULT GAMES IN THE WORLD BECAUSE THE NUMBER OF POSSIBLE STATES IS EVEN MORE THAN THE NUMBER OF ATOMS IN THE UNIVERSE . FOR A GO BOARD OF SIZE (N*N) . THE NUMBER OF POSSIBLE STATES IS ( 3 POW(N SQUARED)) . IMAGINE THE NUMBER OF POSSIBLE STATES FOR A (19 *19) BOARD. SUCH IS “THE CURSE OF DIMENSIONALITY .”

## THE K MEANS CLUSTERING CLASSIFICATION ALGORITHM USED IN MACHINE LEARNING

WE HAVE SEEN HOW CLASSIFICATION PROBLEMS ARE TACKLED USING LOGISTIC REGRESSIONS . HERE WE DISCUSS AN ALGORITHM THAT HELPS US TO CLASSIFY THINGS INTO MULTI -CLASSES . THE INTERESTING PART IS THAT THERE ARE NO LABELLED TAGS ASSOCIATED WITH THE DATA POINTS TO TELL US TO WHICH CLASS A CERTAIN DATA INSTANCE BELONGS(K MEANS CLUSTERING IN MACHINE LEARNING NOT TO BE CONFUSED WITH K NEAREST NEIGHBOURS WHERE WE NEED LABELED DATA) MAKING IT AN UNSUPERVISED MACHINE LEARNING PROBLEM . LETS MAKE THIS POINT CLEAR BY CONSIDERING A REAL LIFE EXAMPLE WHERE WE AS HUMANS HAVE CLASSIFIED NUMEROUS UNLABELLED DATA. ANY LIVING CREATURE IS CLASSIFIED AS AN ANIMAL OR A PLANT .FURTHER WE ASSOCIATE THOUSANDS OF FEATURES TO MAKE CLASSIFICATIONS LIKE THE KINGDOM , CLASS , ORDER AND FAMILY . BUT NOTICE HOW NO ANIMAL HAS A TAG ON IT SAYING I BELONG TO SO AND SO CATEGORY . SO WHEN WE ENCOUNTER A NEW SPECIES HOW DO WE DECIDE AS TO WHICH CLASS THEY BELONG TO .

MOREOVER WHAT IS THE LEVEL OF CLASSIFICATION REQUIRED DEPENDS ON THE PROBLEM STATEMENT . SOMEONE MIGHT BE INTERESTED IN FULL ROOT LEVELS OF CLASSIFICATION , LIKE A RESEARCHER , WHILE FOR SOME THE DIFFERENCE BETWEEN BEING A REPTILE OR A BIRD IS ENOUGH . THIS LEADS TO A MAJOR CONCLUSION . DEPENDING ON HOW COMPLEX OUR CLASSES ARE WE CAN HAVE POINTS WHICH FALL TOGETHER IN A CERTAIN CLASS FOR ONE LEVEL OF CLASSIFICATION WHILE THEY MAY CHANGE CLASSES IF COMPLEXITY INCREASES .

EXAMPLE A PIGEON AND A RABBIT FALL UNDER THE SAME CLASS IF THE DIVISION IS JUST BASED ON WHETHER A CERTAIN ANIMAL LIVES IN WATER OR NOT . BUT THEY FALL IN DIFFERENT CLASSES IF FURTHER DETAILS ARE CONSIDERED .

## WHAT DOES ” K ” SIGNIFY

THE DIFFICULTY /COMPLEXITY OF A PROBLEM LIES IN THE FACT THAT INTO HOW MANY CLASSES ONE HAS TO DISTRIBUTE THE DATA INSTANCES .

IN MACHINE LEARNING THIS IS THE BASIC IDEA BEHIND K MEANS CLUSTERING . THE VALUE OF K SHOWS HOW MANY “CLASSES ” WE ARE CONSIDERING . IN OTHER WORDS THE NUMBER OF CENTROIDS OUR ALGORITHM WILL USE . HENCE A LARGER K IMPLIES MAKING THE CLASSIFICATION MORE STRICTER . THEORETICALLY ONE CAN HAVE AS MANY CLASSES AS THERE ARE DATA POINTS AVAILABLE IN THE DATA SET . THAT WOULD BE BEING SO STRICT THAT EVERY OBJECT BECOMES A CLASS AS WELL AS THE ONLY MEMBER OF THE CLASS!!!

## HOW TO MEASURE “CLOSENESS” : DISTANCE AND ITS TYPES

OBVIOUSLY THINGS THAT ARE SIMILAR OR A RELATED “CLOSELY” TEND TO FALL WITHIN SAME CLASSES .MATHEMATICALLY CLOSENESS REFERS TO THE THE DISTANCE BETWEEN TWO POINTS : DISTANCES ARE OF THE FOLLOWING TYPES :

1. EUCLIDEAN
2. MANHATTAN
3. MINKOWSKI
4. HAMMIMG

## THE BLACK POINTS ARE THE CENTROID POINTS , 3 CENTROIDS RESULT IN CLASSIFICATION INTO 3 GROUPS

IN K MEANS CLUSTERING WE USE THE WELL KNOWN EUCLIDEAN DISTANCE METRIC . LETS SEE THE ALGORITHM:

1. YOU HAVE THE DATA SET (UNLABELED ) PLOTTED .
2. CHOOSE THE VALUE OF K – THE NUMBER OF CLASSES YOU WANT
3. RANDOMLY DRAW K POINTS ON THE PLOT (THESE ARE THE K CENTROIDS ) .
4. FOR EVERY POINT CALCULATE THE K DISTANCES (DISTANCE FROM EACH CENTROID ).
5. ASSOCIATE THE POINT WITH THE CENTROID WITH WHICH IT HAS THE MINIMUM DISTANCE .
6. NOW YOU HAVE DIVIDED THE DATA SET POINTS INTO K SETS , EACH SET HAS POINTS THAT ARE NEAREST TO A PARTICULAR CENTROID .
7. NOW SUPPOSE IN A PARTICULAR SET S ,THERE ARE M POINTS , CALCULATE THE MEAN COORDINATE OF THESE M POINTS .
8. THIS MEAN COORDINATE IS THE NEW CENTROID . DO THIS FOR ALL K SETS . WE GET K UPDATED CENTROID POINTS
9. REPEAT FROM STEP 4 UNTIL IN ANY ITERATION NONE OF THE POINTS CHANGE THEIR SET .

## HOW DO WE DECIDE THE BEST K VALUES FOR OUR DATA SET ?

NOT ALL DATA SETS ARE THE SAME , SOME COULD BE EASILY LINEARLY SEPARABLE , HENCE K=2 WOULD BE ENOUGH . BUT IN MANY CASES THIS IS NOT POSSIBLE . IT ALSO VARIES ACCORDING TO THE COMPLEXITY OF THE PROBLEM . WE USE THE ELBOW METHOD TO DECIDE THE IDEAL VALUE OF K FOR A PARTICULAR DATA SET :

THIS IS TO ENSURE THAT MODEL DOESN’T GET OVER FIT . SURELY ADDING MORE CLASSES WILL MAKE THE MODEL BETTER . BUT IF WE KEEP ON ADDING CLASSES SOON WE WILL BE OVER FITTING AND EVENTUALLY EACH OBJECT IN THE DATA SET WOULD BE A CLASS OF ITS OWN!!

IN THE ELBOW METHOD WE PLOT THE VARIANCE VS THE NUMBER OF CLASSES . THE GRAPH TURNS OUT TO LOOK LIKE “AN ELBOW” . DECREASING SHARPLY INITIALLY AND THEN FORMING AN “L ” SHAPE CURVE .THE SHARPNESS OF THE BEND DEPENDS ON THE PARTICULAR DATASET. AND THIS SHARP “BEND” POINT CORRESPONDS TO THE IDEAL K . WHY ? BECAUSE NOW FOR EVERY FURTHER INTRODUCTION OF NEW CLASS THE CHANGES IN THE CLASSIFICATIONS ARE MINIMAL . TAKE A LOOK AT THE GRAPH BELOW , THINGS WILL GET CLEAR WITH THIS :

## A PLOT OF AVERAGE DISPERSION(VARIANCE ) VS NUMBER OF CLASSES(K)

HAPPY CLASSIFYING!!!