# FINITE MARKOV DECISION PROCESS -PART 2

## DESCRIBING POLICIES AND VALUE FUNCTIONS OF A MARKOV DECISION PROCESS

THE LAST ARTICLE (FINITE MARKOV DECISION PROCESS -PART 1 ) WAS FOCUSED MAJORLY ON DESCRIBING WHAT WE MEAN BY RETURNS, AN ENVIRONMENT , AND AN ENVIRONMENT STATE FOLLOWING THE MARKOV PROPERTY . SO WHATS NEXT . NOW WE GET INTO THE MATHEMATICS OF POLICIES AND VALUE FUNCTIONS . WE SEE HOW TO DEFINE THESE MATHEMATICALLY AND THEN SEE THE ASSOCIATED BELLMAN EQUATIONS . LATER WE SEE THE OPTIMISED POLICIES AND BELLMAN FUNCTIONS AND HOW THEY ULTIMATELY LEAD US TO THE FINAL FUNCTION WE ARE LOOKING FOR . THEN WE SEE WHAT PROBLEMS SUCH SOLUTIONS PRESENT AND HOW WE CAN APPROXIMATE THE FUNCTION TO MEET OUR DEMANDS . IF YOU ARE NOT FAMILIAR WITH THE TERMINOLOGIES I WILL RECOMMEND REFERRING THE PREVIOUS TWO ARTICLES .

## POLICIES AND VALUE FUNCTIONS

THE VERY PURPOSE OF ANY REINFORCEMENT PROBLEM IS TO CALCULATE VALUE FUNCTIONS RELATED TO A STATE OR A STATE ACTION PAIR . WHICH IS A MEASURE OF HOW “GOOD”/”BAD ” IT IS FOR AN AGENT IF IT WERE TO START FROM THE PARTICULAR STATE . MATHEMATICALLY THIS GOOD/BAD HAS TO DO WITH WHAT FUTURE REWARDS ONE CAN EXPECT STARTING FROM A GIVEN STATE . THE POLICY FUNCTIONS ARE DEFINED USING WHAT ARE CALLED POLICIES . WE DEFINED POLICIES IN THE FIRST ARTICLE OF REINFORCEMENT LEARNING AS A MAPPING OF STATES TO PROBABILITIES OF SELECTING EACH POSSIBLE ACTION . BELOW IS A FORMAL DEFINITION OF A POLICY , TAKEN FROM AN INTRODUCTION TO REINFORCEMENT LEARNING BY BARTON AND SUTTON :

## THE MATHEMATICS OF VALUE FUNCTIONS

SUPPOSE THE AGENT IS IN A STATE St . THE VALUE FUNCTION FOR THE STATE IS THE EXPECTED RETURN VALUE , GIVEN THE AGENT FOLLOWS A CERTAIN POLICY . MATHEMATICALLY THE VALUE FUNCTION FOR A MARKOV DECISION PROCESS IS :

## E REFERS TO THE EXPECTATION VALUE , Gt IS THE RETURN FUNCTION

THE V pi(s) IS KNOWN AS THE STATE VALUE FUNCTION FOR A POLICY pi . WE DEFINE YET ANOTHER TYPE OF VALUE FUNCTION THAT INCORPORATES THE ACTION TAKEN FROM A CERTAIN STATE FOLLOWING A PARTICULAR POLICY . THIS TYPE OF VALUE FUNCTION KNOWN AS ACTION VALUE FUNCTION , IS DEFINED AS :

## PROPERTY OF VALUE FUNCTIONS

WE INTRODUCE AN IMPORTANT FEATURE OF A VALUE FUNCTION WHICH WILL BE USED IN THE ENTIRE REINFORCEMENT LEARNING ARTICLES AND IN DYNAMIC PROGRAMMING . AND THIS FEATURE IS THE RECURSIVE NATURE OF VALUE FUNCTIONS . SO FOR ANY STATE St AND THE ONES THAT COME IN FUTURE BY FOLLOWING A POLICY (pi) THE FOLLOWING CONDITION STANDS TRUE AS A RESULT OF THE RECURSIVE NATURE :

THE ABOVE EQUATION IS THE BELLMAN EQUATION WE MENTIONED IN THE BEGINNING OF THIS ARTICLE . NOTICE HOW WE STARTED WITH THE STATE VALUE FUNCTION , REPRESENTED THE RETURN FUNCTION AS A FUNCTION OF RETURN FUNCTION AT A FUTURE STEP AND THEN JUST USED THE FACT WE STATED IN THE PREVIOUS ARTICLE . THE BELLMAN EQUATION IS JUST THE PRODUCT OF ALL POLICIES FROM S AND THE SUM OF ALL PROBABILITIES OF TRANSITIONING USING ACTIONS AND GAINING REWARDS . WE KNOW USE THIS RECURSIVE BELLMAN EQUATION TO FIGURE OUT THE OPTIMAL POLICIES AND VALUE FUNCTIONS .

## OPTIMAL POLICIES AND VALUE FUNCTIONS

NOW OF ALL THE POLICIES AND VALUE FUNCTIONS WE WANT THE BEST ONES!! . THE BEST OPTIMAL POLICIES AND FUNCTIONS ARE JUST THE ONES THAT HAVE MAXIMUM VALUES AND HENCE GIVE MAXIMUM RETURN IN THE LONG RUN . THIS IS EXACTLY WHAT AN AGENT DESIRES . NOW WE LOOK AT THE TWO OPTIMAL EQUATIONS , PRACTICALLY THEY ARE NOTHING DIFFERENT FROM THE PREVIOUS ONES , WE JUST HAVE ADDED A “MAX” FUNCTION TO THEM :

### THE OPTIMAL ACTION VALUE FUNCTION . THIS IS WHAT THE AGENT ULTIMATELY HAS TO FIND OUT .

THE ABOVE BELLMAN EQUATION IS THE ONE WE WERE LOOKING FOR !!! BUT…………

THIS COMES AT A COST! THEORETICALLY WE HAVE THE ANSWER . BUT IN REALITY THIS EQUATION IS VERY DIFFICULT TO SOLVE . MEMORY IS A BIG CONSTRAINT IN COMPUTATION AS THE DIMENSIONS INCREASE AND THE NUMBER OF ALL POSSIBLE STATES SHOOT UP EXPONENTIALLY . REMEMBER USING DYNAMIC PROGRAMMING / LOOK UP TABLES IS THE WAY BUT EVEN SMALL GAMES LIKE CHESS BECOME HUGE IN TERMS OF COMPUTATION AND MEMORY AS THE DIMENSIONS INCREASE .

HENCE WE HAVE TO SETTLE FOR APPROXIMATIONS . MOREOVER MARKOV DECISION MAKING BECOMES CUMBERSOME ONCE DIMENSIONALITY INCREASES . IN THE NEXT ARTICLE WE WILL EXPLORE AND TRY TO SOLVE A PROBLEM USING MARKOV PROPERTY . THEN WE WILL HEAD TO MONTE CARLO METHODS OF SOLVING SUCH PROBLEMS .

HAPPY LEARNING!!! MAKE SOME MEMORIES , WE MIGHT NEED THEM!!