# FINITE MARKOV DECISION PROCESS -PART 1

## DESCRIBING THE MARKOV DECISION PROCESS IN MACHINE LEARNING

IN THIS POST WE DISCUSS A CERTAIN TYPE OF REINFORCEMENT PROBLEM . WE DEFINE A MERKONIKOV PROCESS AND THEN WE GET TO THE MATHS . BUT AS ALWAYS FIRST WE NEED TO UNDERSTAND A FEW DEFINITIONS . THEN WE TRY TO RELATE WITH A REAL LIFE PROBLEM . WE HAVE ALREADY SEEN HOW AN AGENT INTERACTS WITH ITS ENVIRONMENT AND TRIES TO MAXIMISE HIS LONG TERM GAINS . FOR ANY PROBLEM WE REPRESENT “

1. THE SET OF ALL POSSIBLE STATES BY S AND AND THE STATE AT TIME t BY St
2. THE SET OF ALL POSSIBLE ACTIONS BY A AND THE STATE AT TIME t BY At
3. THE POLICY MAPPING IN A STATE S TO AN ACTION A BY pi(a|s)

## RETURNS

WE HAVE SEEN THAT THE GOAL OF AN AGENT IS TO MAXIMISE THE RETURNS . RETURNS REFER TO THE LONG TERM REWARDS STARTING FROM A PARTICULAR STATE . SO HOW TO REPRESENT IT MATHEMATICALLY. AS STATED IT MUST BE SOME FUNCTION OF PRESENT AND FUTURE REWARDS

SO WHAT IS A MARKOV PROCESS ? WELL WE NEED TO CONSIDER TWO THINGS HERE . FIRST THAT THE AGENT KNOWS WELL WHAT THE ENVIRONMENT EXACTLY IS . SECOND IF AT ANY TIME t , THE AGENT IS IN A STATE S , THE EXPECTED RETURNS FROM THERE ONWARDS DEPENDS ONLY ON THAT STATE AND THE ACTIONS TAKEN FROM THERE . THAT IS HOW YOU ENDED UP IN THAT STATE WON’T EFFECT THE FUTURE RETURNS . THIS IS HOW IT LOOKS LIKE :

## THE DISCOUNT PARAMETER

NOTICE THE PARAMETER GAMMA AND ITS SUBSEQUENT POWERS IN THE EXPRESSION? GAMMA IS REFERRED AS THE “DISCOUNT ” PARAMETER . IT IS A MEASURE THAT ALLOWS US TO SHOW THE DIMINISHING IMPORTANCE OF FUTURE REWARDS . A SMALL GAMMA WOULD TEND TOWARDS A GREEDY APPROACH WHERE THE RETURN FUNCTION FOCUSES ON THE MORE IMMEDIATE REWARDS . IF GAMMA =0 , THE AGENT IS SAID TO BE “MYOPIC ” THAT IS , FOCUSED ON JUST THE IMMEDIATE REWARD . SECONDLY GAMMA HELPS IN CONVERGENCE OF THE RETURN FUNCTION IF THE NUMBER OF TOTAL ALLOWED STEPS /ACTIONS IS NOT PREDEFINED . IN SUCH CASES , IN ABSENSE OF GAMMA , THE RETURN FUNCTION WOULD TURN INTO AN INFINITE SUM .

## THE MARKOV PROPERTY

A MARKONIKOV PROPERTY PROPERTY COMES UNDER A MORE GENERAL DEFINITION OF AN ENVIRONMENT . LETS DISCUSS WHAT DO WE MEAN BY A GENERAL ENVIRONMENT . SUPPOSE THE AGENT IS CURRENTLY IN A STATE St .

NOW IN A MORE GENERAL SENSE WHAT HAPPENS NEXT WOULD DEPEND IN HOW YOU ACTUALLY MADE UP TO THAT PARTICULAR STATE . THAT IS WHAT HAPPENS IS NOT ONLY A FUNCTION OF THE PRESENT STATE , RATHER THAN DEPENDING ON THE WHOLE CUMULATIVE PAST .

A GENERALISED REAL LIFE ENVIRONMENT

SUPPOSE YOU QUALIFIED THE FIRST LEVEL OF A MCQ BASED EXAM . LET SAY THE PREPARATION FOR THE SECOND ROUND IS THE PRESENT STATE St . NOW WHETHER YOU WOULD EXCEL IN THE SECOND ROUND WOULD BE DEPENDENT ON HOW YOU PASSED THE FIRST ROUND . IF HALF OF THE QUESTIONS WERE A GUESS WORK AND YOU WERE LUCKY ENOUGH TO GET PAST THE CUT-OFF SCORE YOUR CHANCES OF QUALIFYING THE NEXT ROUND IS LOW . ON THE OTHER HAND IF YOU SOLVED EVERY QUESTION THEN YOUR CHANCES ARE HIGH .

LETS REPRESENT THE ABOVE STATEMENTS MATHEMATICALLY , SUPPOSE THE HISTORY OF REACHING A STATE St IS REPRESENTED BY ((S0 ,A0, R1) ,(S1 ,A1, R2) ………..(S(t-1) , A(t-1), R(t)) . FOR A GENERAL THE FUTURE RETURNS WOULD DEPEND ON THIS HISTORY . (SO IN THE ABOVE EXAM CASE HOW MANY ACTIONS WERE GUESS WORK AND HOW MANY WERE SOLVING AND ANSWERING )

NOW LETS DISCUSS A SPECIAL CASE OF THE ABOVE GENERAL ENVIRONMENT . SUPPOSE YOU ARE PLAYING A GAME OF CHESS , FOR A PARTICULAR STATE THE FUTURE DEPENDS SOLELY ON HOW YOU HAVE REACHED THERE , AND FROM THERE ONLY THE NEXT MOVES MATTER . ANOTHER EXAMPLE IS SUPPOSE YOU HAVE TO REACH MUMBAI FROM LUCKNOW VIA DELHI . NOW ONCE YOU HAVE REACHED DELHI AIRPORT . OPTIMISING THE TIME TAKEN TO REACH MUMBAI WONT DEPEND ON HOW YOU REACHED DELHI . USING THE ABOVE INTUITION WE DEFINE BELOW WHAT IS KNOWN AS A MARKOV PROCESS .

## THE MATH

WE DEFINE THE PROBABILITY AT STEP t OF TRANSITIONING TO A STATE (s’) FROM (s) BY TAKING AN ACTION (a ) AND GETTING A REWARD (r )

#### NOTICE HOW THE PROBABILITY DEPENDS ONLY ON THE PRESENT STATE AND THE NEXT ACTION

SO NOW WE KNOW WHAT A STATE WHICH FOLLOW MARKOV PROPERTY MEANS . NEXT WE DISCUSS POLICY FUNCTIONS AND VALUE FUNCTIONS WITH RESPECT TO MARKOV DECISION MAKING PROCESSES . WE WILL SEE WHAT DRIVES A REINFORCEMENT LEARNING AGENT AND WE WILL BE INTRODUCED TO ONE OF THE MOST VITAL EQUATIONS IN REINFORCEMENT LEARNING , THE BELLMAN EQUATION.

HAPPY LEARNING!!!