KNN and Probability (interview questions)

Article on one of the most common interview question: How do you interpret KNN outputs?

KNN does not have a learning phase . Its a lazy algorithm that just finds the “k” nearest neighbours and performs the classification/regression task almost like a hard coded instruction and nothing “intelligent” seems to happen. While the idea behind the algorithm is pretty simple and straightforward ; it is this simplicity that leads to many possible questions , because when one tries to solve complex real life questions using such simple algorithms , many border cases must be considered.

lets try to answer a few .

We know KNN can be used to solve classification as well as regression. We start with problems faced during classification .

THE PROBLEM IN CLASSIFICATION

SUPPOSE YOU HAVE 4 CLASSES AND THE NUMBER OF NEAREST NEIGHBOURS (K) YOU CHOSE IS 30. FOR A CERTAIN POINT YOU GOT THE FOLLOWING RESULTS :

1. CLASS 1 =10 NEIGHBOURS
2. CLASS 2 =10 NEIGHBOURS
3. CLASS 3= 6 NEIGHBOURS
4. CLASS 4= 4 NEIGHBOURS

NOW WHAT SHOULD OUR TEST POINT BE CLASSIFIED AS ? FIRST LETS CONSIDER WHAT THE NEIGHBOURS FROM THE 3RD AND 4TH CLASS TELL US. CONSIDER A MEDICAL CASE , LET SAY CANCER DETECTION , WHICH CONSISTS OF 2 OUTPUT CLASSES , C1 AND C2 .

YOU USED KNN where k=10 AND GOT 6 AND 4 POINTS RESPECTIVELY AS THE NEIGHBOURS . JUST BECAUSE YOU HAVE MORE NEIGHBOURS OF CLASS 1 CAN YOU RULE OUT THE POSSIBILITY OF THE SECOND CANCER CELL BEING PRESENT ?

In many cases you require to look at the “probabilistic ” results rather than the final selection using more number of neighbours . Considering the above case of 2 classes , following is how the results would differ .

1. if we use simple majority vote KNN : output : cancer of class 1
2. if we use probability scores : output : 60 % percent chance of cancer 1 and 40% chance of cancer 2 .

Now lets get back to the 4 class problem , of course there the maximum neighbours are from class 1 and class 2 . Even using probability scores if we need to give one final output as a solution to a business problem , how can we break the tie . In such cases a lot depends on the domain of the problem , but lets discuss 2 generic ways that we can use to break such a tie.

USING WEIGHTED KNN

FROM CLASS 1 AND CLASS 2 WE CAN SELECT THE FINAL OUTPUT BY USING WEIGHTED KNN .In Weighted KNN we assign more weightage to points that are closer to the test point . So instead of just counting neighbours we can assign “an inverse distance relation” while calculating distance scores .

INCREASE K BY A CERTAIN VALUE AND RECHECK

We used k=30 in the case provided. To beat the tie lets consider we use K=32 or 34 , now calculating the number of neighbours will remove the tie .

THE PROBLEM IN REGRESSION

In KNN ,regression refers to returning either the average /Median of a certain continuous value associated with the nearest neighbours. The problem here is not related to probability , rather it is related to the presence of outliers .

An outlier can mess up the average score but median score would be more robust to such issues .

So in summary a KNN (where k=n) works well only if there is always one such class which dominates the rest n-1 classes in terms of majority . And remember because there is no such thing as ” training” in KNN hence one can do nothing except changing k values if you find the neighbours distributed randomly in n classes.

This article focuses on one of the many problems that one can face during interviews. Other problems and their solutions like kmeans++, kd-trees and more would be discussed in subsequent posts.

TSNE

ML INTERVIEW QUESTIONS -PART 1

IN THESE INTERVIEW PREP SERIES WE LOOK AT IMPORTANT INTERVIEW QUESTIONS ASKED IN DATA SCIENTIST /ML AND DL ROLES .

IN EACH PART WE WILL DISCUSS FEW ML INTERVIEW QUESTIONS.

1) What is PDF?

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable. PDF for an interval indicates the probability of the random variable falling within the interval.

2) What is Confidence Interval?

confidence interval displays the probability that a parameter will fall between a pair of values around the meanConfidence intervals measure the degree of uncertainty or certainty in a sampling method.

3) Can KL divergence be used as a distance measure?

No. It is not a metric measure as it is not symmetric.

4) What is Log-normal distribution?

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

5) What is Spearman Rank Correlation Coefficient?

Spearman Rank Correlation Coefficient is determined by applying Pearson Coefficient on rank encoded random variables.

6) Why is “Naive” Bayes naive?

The conditional independence of the variables of a data frame is an assumption in Naive Bayes which can never be true in practice. The conditional independence assumption is made to simplify the computations of the conditional probabilities. Naive Bayes is naive due to this assumption.

7)What is the “Crowding problem” in t-sne?

This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map.

8)What are the limitations of PCA?

PCA should be used mainly for variables which are strongly correlated.

If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine.

PCA Results Are Difficult To Interpret Clearly.

9)Name 2 failure cases of KNN?

When query point is an outlier or when the data is extremely random and has no information.

10) Name 4 assumptions of linear regression

• Linear relationship
• Multivariate normality
• No or little multicollinearity
• No auto-correlation
• Homoscedasticity

11)Why are log probabilities used in Naive -bayes?

The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

12)How to handle Numerical features in(Gaussian NB)?

Numerical features are assumed to be Gaussian. Probabilities are determined by considering the distribution of the data points belonging to different classes separately.

13)How do you get a feature important in naive Bayes?

The naive bayes classifers don’t offer an intrinsic method to evaluate feature importances. Naïve Bayes methods work by determining the conditional and unconditional probabilities associated with the features and predict the class with the highest probability.

14)Differentiate between GD and SGD.

n both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

In SGD only one data point is used per iteration to calculate the value of the loss function. While for GD all the data points are used to calculate the value of the loss function

15)Do you know the train and run time complexities for a SVM model?

Train time complexity O(n2)

Run time complexity O(k*d)

k=number of support vectors, d=dimensionality of data set

16)Why is RBF kernel SVM compared to kNN?

They are not that similar, but they are related though. The point is, that both kNN and RBF are non-parametric methods to estimate the density of probability of your data.

Notice that this two algorithm approach the same problem differently: kernel methods fix the size of the neighborhood (h) and then calculate K, whereas kNN fixes the number of points, K, and then determines the region in space which contain those points.

17)What decides overfitting and underfitting in DT?

the max_depth parameter decides the overfitting and underfitting in Decision Trees.

18)What is Non negative Matrix Factorization?

decomposing a matrix into 2 smaller matrices with all elements greater than zero and whose product gives us the original matrix.

19)what is Netflix prize problem?

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

20) What are word embeddings?

Word embeddings are a type of word representation IN A VECTOR SPACE that allows words with similar meaning to have a similar representation.

THE MATH BEHIND SVMs , Langrangian ,kernel

In this article we discuss the important mathematical tools used in formulation of SVMs and also point out the important questions asked in interviews .

SVMs are one of those algorithms which people use as a black box , but interviewers might end up asking the details of the mathematics involved And in general its always better to know the math behind any algorithm.

THE PROBLEM STATEMENT

Classification: THE GOAL OF SVM IS TO FIND A HYPERPLANE IN THE d DIMENSIONAL VECTORSPACE( dimensionality of the data set) WHICH SEPARATES THE DATA WITH “MOST MARGIN”, OR IN OTHER WORDS “OPTIMALLY” ( as multiple hyperplanes can separate the points , we desire the best out of all those”

QUESTION-WHAT IS HARD MARGIN SVM?

IN HARD MARGIN SVM WE HAVE ZERO TOLERANCE FOR ANY ERROR , IT TRIES TO FIND THE OPTIMAL HYPERPLANE ASSUMING ALL POINTS WILL BE CORRECTLY CLASSIFIED AT THE END .

OPTIMISATION STATEMENT

SO WHAT ARE WE TRY TO OPTIMIZE IN SVMs?

lets consider a point whose actual class is y( -1 or 1) , lets consider the predicted value for that point is (w.x+b) . note that there are 4 cases possible :

1. the product y*(w.x+b) is positive because y is “+1” and predicted distance from plane is also positive.
2. the product y*(w.x+b) is positive because y is “-1” and predicted distance from plane is also negative.
3. the product y*(w.x+b) is negative because y is “+1” and predicted distance from plane is negative.
4. the product y*(w.x+b) is negative because y is “-1” and predicted distance from plane is positive.

you can see that everytime there is a wrong classification we see this product is negative. and every correct classification means positive product .

So we can start by saying that our optimal hyperparameter will try to maximize the product to positive for all the points , from now we will use w/||w|| to introduce scale invariancy in the hyperplane.

now a bit tricky statememt , ” the most optimal hyperplane is that whose min. value of the above product( which will occur for the closest point) should be largest among all hyperplanes ” . In layman terms , the closest point (whose distance is called margin ) should be maximised . So the following is our optimisation problem :

where m is the margin , (distance from the closest point) . the hyperplane with the max M will be selected.

where gamma is the product function discussed above. now we try to tweak the above expression. remember from your class 12th 3-d geometry classes that scaling the w and b wont change the final optimal hyperplane , so if we rescale the w and b to make the min product=1( but why? so that our optimization problem can be reduced to a single variable ||w|| ! ). Now we will be left only with

max(w,b) 1/||w|| subject to the same conditions as above , which is same as min(w,b) ||w|| .

now minimizing ||w|| and (1/2)||w||2 would be the same problem ( to make further mathematics easier) . Hence our optimization problem boils down to:

lets solve this hard margin classifier .

CONSTRAINED OPTIMIZATION

QUESTION: HOW ARE CONSTRAINED OPTIMISATION PROBLEMS DIFFERENT THAN SIMPLE OPTIMISATION PROBLEMS ?

CONSTRAINED OPTIMISATION PROBLEMS ARE SOLVED USING THE CONCEPT OF LANGRANGE MULTIPLIERS.

THE DUAL FORM

SUBSTITUTING IN THE LANGRANGE EQUATION ABOVE YOU WILL HAVE :

WHICH REDUCES OUR DUAL PRBLEM TO :

SEE HOW NOW THIS PROBLEM DEPENDS ONLY ON THE LANGRANGE MULTIPLIERS . THE WAY THE ABOVE PROBLEM IS SOLVED WHEN YOU USE LIBRARIES IS PYTHON IS BY USING SOMETHING KNOWN AS SMO( Sequential minimal optimization algorithm) libraries like libsvm use this algorithm.

we know that in real life seldom do we have completely linearly separable data points. So we must have a variation which allows a little tolerance on errors . And of course this means some mathematical changes in our optimization problem . This change will also let us have a control on how much error is tolerated and hence will act as a regularization parameter for our objective function.

Imagine you have a strict dad who says you must score 100 in your maths paper (hard margin ). Somehow you are able to make that puppy face and hit his soft corner .He says that he will consider anything above 95 good . S o basically what you did was add a tolerance of 5 , in the same way in our soft-margin classifier we introduce a slack variable zeta for every point :

1. but theoretically we can always choose large enough zetas to make our solution work . to prevent that from happening we can penalize large zeta values .
2. we also must ensure that zetas are all positive.
3. we add one more parameter C (regularization parameter) to keep a check on how important the zetas are
4. keeping all the above points in mind we get to the following expression .

PRIMAL FORM

AND THE CORRESPONDING DUAL FORM AFTER APPLYING LANGRANGE MULTIPLIERS :

THE PARAMETER C SHOWS THE IMPORTANCE OF ZETAS . THE IMPORTANCE OF ZETAS IS INVERSELY PROPORTIONAL TO THE VALUE OF C . INFINITE POSITIVE C MEANS HARD MARGIN CLASSIFIER. REMEMBER THIS AND IT WILL HELP IN INTERVIEWS .

SOLVING NON-LINEAR PROBLEMS with SVMs

THE KERNEL TRICK ENABLES US TO PROJECT THE POINTS INTO HIGHER DIMENSION AND HENCE MAKE THE POINTS WHICH WERE SEEMINGLE UNSEPARABLE BY A HYPERPLANE IN LOWER DIMENSION BECOME SEPARABLE IN HIGHER DIMENSIONS:

MATHEMATICALLY IN THE DUAL FORM WE TRY TO DEFINE A KERNEL FUNCTION K WHIC WHICH CALCULATES THE DOT PRODUCT xi.xj in higher dimensions. hence our dual function becomes:

there are different varieties of kernel functions available .

1. the linear kernel
2. the polynomial kernel
3. THE RBF KERNEL

1. CAN SOLVE NON LINEAR PROBLEMS .
2. RUN TIME DEPENDS SOLELY ON SUPPORT VECTORS AS ALPHA VALUES FOR OTHER POINTS ARE ZERO
3. DIFFERENT KERNEL FUNCTIONS MAKE IT VERSATILE.
4. WORKS WELL IF DIMENSIONALITY IS MORE THAN NUMBER OF SAMPLES.
5. PRONE TO OUTLIERS.

1. TUNING AND SELECTING KERNEL AND HYPERPARAMETRS.
2. HIGH TRAINING TIME . HENCE SCALIBILITY ISSUES .
3. PERFORMANCE DEGRADES IF OVERLAPPING CLASSES IS MORE(NOISE)

A NOTE FROM THE DOCUMENTATION PAGE OF SCIKITLEARN (REGARDING OUTPUT ANF FEATURE IMPORATANCES:

DESCRIBING POLICIES AND VALUE FUNCTIONS OF A MARKOV DECISION PROCESS

THE LAST ARTICLE (FINITE MARKOV DECISION PROCESS -PART 1 ) WAS FOCUSED MAJORLY ON DESCRIBING WHAT WE MEAN BY RETURNS, AN ENVIRONMENT , AND AN ENVIRONMENT STATE FOLLOWING THE MARKOV PROPERTY . SO WHATS NEXT . NOW WE GET INTO THE MATHEMATICS OF POLICIES AND VALUE FUNCTIONS . WE SEE HOW TO DEFINE THESE MATHEMATICALLY AND THEN SEE THE ASSOCIATED BELLMAN EQUATIONS . LATER WE SEE THE OPTIMISED POLICIES AND BELLMAN FUNCTIONS AND HOW THEY ULTIMATELY LEAD US TO THE FINAL FUNCTION WE ARE LOOKING FOR . THEN WE SEE WHAT PROBLEMS SUCH SOLUTIONS PRESENT AND HOW WE CAN APPROXIMATE THE FUNCTION TO MEET OUR DEMANDS . IF YOU ARE NOT FAMILIAR WITH THE TERMINOLOGIES I WILL RECOMMEND REFERRING THE PREVIOUS TWO ARTICLES .

POLICIES AND VALUE FUNCTIONS

THE VERY PURPOSE OF ANY REINFORCEMENT PROBLEM IS TO CALCULATE VALUE FUNCTIONS RELATED TO A STATE OR A STATE ACTION PAIR . WHICH IS A MEASURE OF HOW “GOOD”/”BAD ” IT IS FOR AN AGENT IF IT WERE TO START FROM THE PARTICULAR STATE . MATHEMATICALLY THIS GOOD/BAD HAS TO DO WITH WHAT FUTURE REWARDS ONE CAN EXPECT STARTING FROM A GIVEN STATE . THE POLICY FUNCTIONS ARE DEFINED USING WHAT ARE CALLED POLICIES . WE DEFINED POLICIES IN THE FIRST ARTICLE OF REINFORCEMENT LEARNING AS A MAPPING OF STATES TO PROBABILITIES OF SELECTING EACH POSSIBLE ACTION . BELOW IS A FORMAL DEFINITION OF A POLICY , TAKEN FROM AN INTRODUCTION TO REINFORCEMENT LEARNING BY BARTON AND SUTTON :

THE MATHEMATICS OF VALUE FUNCTIONS

SUPPOSE THE AGENT IS IN A STATE St . THE VALUE FUNCTION FOR THE STATE IS THE EXPECTED RETURN VALUE , GIVEN THE AGENT FOLLOWS A CERTAIN POLICY . MATHEMATICALLY THE VALUE FUNCTION FOR A MARKOV DECISION PROCESS IS :

E REFERS TO THE EXPECTATION VALUE , Gt IS THE RETURN FUNCTION

THE V pi(s) IS KNOWN AS THE STATE VALUE FUNCTION FOR A POLICY pi . WE DEFINE YET ANOTHER TYPE OF VALUE FUNCTION THAT INCORPORATES THE ACTION TAKEN FROM A CERTAIN STATE FOLLOWING A PARTICULAR POLICY . THIS TYPE OF VALUE FUNCTION KNOWN AS ACTION VALUE FUNCTION , IS DEFINED AS :

PROPERTY OF VALUE FUNCTIONS

WE INTRODUCE AN IMPORTANT FEATURE OF A VALUE FUNCTION WHICH WILL BE USED IN THE ENTIRE REINFORCEMENT LEARNING ARTICLES AND IN DYNAMIC PROGRAMMING . AND THIS FEATURE IS THE RECURSIVE NATURE OF VALUE FUNCTIONS . SO FOR ANY STATE St AND THE ONES THAT COME IN FUTURE BY FOLLOWING A POLICY (pi) THE FOLLOWING CONDITION STANDS TRUE AS A RESULT OF THE RECURSIVE NATURE :

THE ABOVE EQUATION IS THE BELLMAN EQUATION WE MENTIONED IN THE BEGINNING OF THIS ARTICLE . NOTICE HOW WE STARTED WITH THE STATE VALUE FUNCTION , REPRESENTED THE RETURN FUNCTION AS A FUNCTION OF RETURN FUNCTION AT A FUTURE STEP AND THEN JUST USED THE FACT WE STATED IN THE PREVIOUS ARTICLE . THE BELLMAN EQUATION IS JUST THE PRODUCT OF ALL POLICIES FROM S AND THE SUM OF ALL PROBABILITIES OF TRANSITIONING USING ACTIONS AND GAINING REWARDS . WE KNOW USE THIS RECURSIVE BELLMAN EQUATION TO FIGURE OUT THE OPTIMAL POLICIES AND VALUE FUNCTIONS .

OPTIMAL POLICIES AND VALUE FUNCTIONS

NOW OF ALL THE POLICIES AND VALUE FUNCTIONS WE WANT THE BEST ONES!! . THE BEST OPTIMAL POLICIES AND FUNCTIONS ARE JUST THE ONES THAT HAVE MAXIMUM VALUES AND HENCE GIVE MAXIMUM RETURN IN THE LONG RUN . THIS IS EXACTLY WHAT AN AGENT DESIRES . NOW WE LOOK AT THE TWO OPTIMAL EQUATIONS , PRACTICALLY THEY ARE NOTHING DIFFERENT FROM THE PREVIOUS ONES , WE JUST HAVE ADDED A “MAX” FUNCTION TO THEM :

THE OPTIMAL ACTION VALUE FUNCTION . THIS IS WHAT THE AGENT ULTIMATELY HAS TO FIND OUT .

THE ABOVE BELLMAN EQUATION IS THE ONE WE WERE LOOKING FOR !!! BUT…………

THIS COMES AT A COST! THEORETICALLY WE HAVE THE ANSWER . BUT IN REALITY THIS EQUATION IS VERY DIFFICULT TO SOLVE . MEMORY IS A BIG CONSTRAINT IN COMPUTATION AS THE DIMENSIONS INCREASE AND THE NUMBER OF ALL POSSIBLE STATES SHOOT UP EXPONENTIALLY . REMEMBER USING DYNAMIC PROGRAMMING / LOOK UP TABLES IS THE WAY BUT EVEN SMALL GAMES LIKE CHESS BECOME HUGE IN TERMS OF COMPUTATION AND MEMORY AS THE DIMENSIONS INCREASE .

HENCE WE HAVE TO SETTLE FOR APPROXIMATIONS . MOREOVER MARKOV DECISION MAKING BECOMES CUMBERSOME ONCE DIMENSIONALITY INCREASES . IN THE NEXT ARTICLE WE WILL EXPLORE AND TRY TO SOLVE A PROBLEM USING MARKOV PROPERTY . THEN WE WILL HEAD TO MONTE CARLO METHODS OF SOLVING SUCH PROBLEMS .

HAPPY LEARNING!!! MAKE SOME MEMORIES , WE MIGHT NEED THEM!!

DESCRIBING THE MARKOV DECISION PROCESS IN MACHINE LEARNING

IN THIS POST WE DISCUSS A CERTAIN TYPE OF REINFORCEMENT PROBLEM . WE DEFINE A MERKONIKOV PROCESS AND THEN WE GET TO THE MATHS . BUT AS ALWAYS FIRST WE NEED TO UNDERSTAND A FEW DEFINITIONS . THEN WE TRY TO RELATE WITH A REAL LIFE PROBLEM . WE HAVE ALREADY SEEN HOW AN AGENT INTERACTS WITH ITS ENVIRONMENT AND TRIES TO MAXIMISE HIS LONG TERM GAINS . FOR ANY PROBLEM WE REPRESENT “

1. THE SET OF ALL POSSIBLE STATES BY S AND AND THE STATE AT TIME t BY St
2. THE SET OF ALL POSSIBLE ACTIONS BY A AND THE STATE AT TIME t BY At
3. THE POLICY MAPPING IN A STATE S TO AN ACTION A BY pi(a|s)

RETURNS

WE HAVE SEEN THAT THE GOAL OF AN AGENT IS TO MAXIMISE THE RETURNS . RETURNS REFER TO THE LONG TERM REWARDS STARTING FROM A PARTICULAR STATE . SO HOW TO REPRESENT IT MATHEMATICALLY. AS STATED IT MUST BE SOME FUNCTION OF PRESENT AND FUTURE REWARDS

SO WHAT IS A MARKOV PROCESS ? WELL WE NEED TO CONSIDER TWO THINGS HERE . FIRST THAT THE AGENT KNOWS WELL WHAT THE ENVIRONMENT EXACTLY IS . SECOND IF AT ANY TIME t , THE AGENT IS IN A STATE S , THE EXPECTED RETURNS FROM THERE ONWARDS DEPENDS ONLY ON THAT STATE AND THE ACTIONS TAKEN FROM THERE . THAT IS HOW YOU ENDED UP IN THAT STATE WON’T EFFECT THE FUTURE RETURNS . THIS IS HOW IT LOOKS LIKE :

THE DISCOUNT PARAMETER

NOTICE THE PARAMETER GAMMA AND ITS SUBSEQUENT POWERS IN THE EXPRESSION? GAMMA IS REFERRED AS THE “DISCOUNT ” PARAMETER . IT IS A MEASURE THAT ALLOWS US TO SHOW THE DIMINISHING IMPORTANCE OF FUTURE REWARDS . A SMALL GAMMA WOULD TEND TOWARDS A GREEDY APPROACH WHERE THE RETURN FUNCTION FOCUSES ON THE MORE IMMEDIATE REWARDS . IF GAMMA =0 , THE AGENT IS SAID TO BE “MYOPIC ” THAT IS , FOCUSED ON JUST THE IMMEDIATE REWARD . SECONDLY GAMMA HELPS IN CONVERGENCE OF THE RETURN FUNCTION IF THE NUMBER OF TOTAL ALLOWED STEPS /ACTIONS IS NOT PREDEFINED . IN SUCH CASES , IN ABSENSE OF GAMMA , THE RETURN FUNCTION WOULD TURN INTO AN INFINITE SUM .

THE MARKOV PROPERTY

A MARKONIKOV PROPERTY PROPERTY COMES UNDER A MORE GENERAL DEFINITION OF AN ENVIRONMENT . LETS DISCUSS WHAT DO WE MEAN BY A GENERAL ENVIRONMENT . SUPPOSE THE AGENT IS CURRENTLY IN A STATE St .

NOW IN A MORE GENERAL SENSE WHAT HAPPENS NEXT WOULD DEPEND IN HOW YOU ACTUALLY MADE UP TO THAT PARTICULAR STATE . THAT IS WHAT HAPPENS IS NOT ONLY A FUNCTION OF THE PRESENT STATE , RATHER THAN DEPENDING ON THE WHOLE CUMULATIVE PAST .

A GENERALISED REAL LIFE ENVIRONMENT

SUPPOSE YOU QUALIFIED THE FIRST LEVEL OF A MCQ BASED EXAM . LET SAY THE PREPARATION FOR THE SECOND ROUND IS THE PRESENT STATE St . NOW WHETHER YOU WOULD EXCEL IN THE SECOND ROUND WOULD BE DEPENDENT ON HOW YOU PASSED THE FIRST ROUND . IF HALF OF THE QUESTIONS WERE A GUESS WORK AND YOU WERE LUCKY ENOUGH TO GET PAST THE CUT-OFF SCORE YOUR CHANCES OF QUALIFYING THE NEXT ROUND IS LOW . ON THE OTHER HAND IF YOU SOLVED EVERY QUESTION THEN YOUR CHANCES ARE HIGH .

LETS REPRESENT THE ABOVE STATEMENTS MATHEMATICALLY , SUPPOSE THE HISTORY OF REACHING A STATE St IS REPRESENTED BY ((S0 ,A0, R1) ,(S1 ,A1, R2) ………..(S(t-1) , A(t-1), R(t)) . FOR A GENERAL THE FUTURE RETURNS WOULD DEPEND ON THIS HISTORY . (SO IN THE ABOVE EXAM CASE HOW MANY ACTIONS WERE GUESS WORK AND HOW MANY WERE SOLVING AND ANSWERING )

NOW LETS DISCUSS A SPECIAL CASE OF THE ABOVE GENERAL ENVIRONMENT . SUPPOSE YOU ARE PLAYING A GAME OF CHESS , FOR A PARTICULAR STATE THE FUTURE DEPENDS SOLELY ON HOW YOU HAVE REACHED THERE , AND FROM THERE ONLY THE NEXT MOVES MATTER . ANOTHER EXAMPLE IS SUPPOSE YOU HAVE TO REACH MUMBAI FROM LUCKNOW VIA DELHI . NOW ONCE YOU HAVE REACHED DELHI AIRPORT . OPTIMISING THE TIME TAKEN TO REACH MUMBAI WONT DEPEND ON HOW YOU REACHED DELHI . USING THE ABOVE INTUITION WE DEFINE BELOW WHAT IS KNOWN AS A MARKOV PROCESS .

THE MATH

WE DEFINE THE PROBABILITY AT STEP t OF TRANSITIONING TO A STATE (s’) FROM (s) BY TAKING AN ACTION (a ) AND GETTING A REWARD (r )

NOTICE HOW THE PROBABILITY DEPENDS ONLY ON THE PRESENT STATE AND THE NEXT ACTION

SO NOW WE KNOW WHAT A STATE WHICH FOLLOW MARKOV PROPERTY MEANS . NEXT WE DISCUSS POLICY FUNCTIONS AND VALUE FUNCTIONS WITH RESPECT TO MARKOV DECISION MAKING PROCESSES . WE WILL SEE WHAT DRIVES A REINFORCEMENT LEARNING AGENT AND WE WILL BE INTRODUCED TO ONE OF THE MOST VITAL EQUATIONS IN REINFORCEMENT LEARNING , THE BELLMAN EQUATION.

HAPPY LEARNING!!!