## DESCRIBING POLICIES AND VALUE FUNCTIONS OF A MARKOV DECISION PROCESS

THE LAST ARTICLE (FINITE MARKOV DECISION PROCESS -PART 1 ) WAS FOCUSED MAJORLY ON DESCRIBING WHAT WE MEAN BY RETURNS, AN ENVIRONMENT , AND AN ENVIRONMENT STATE FOLLOWING THE MARKOV PROPERTY . SO WHATS NEXT . NOW WE GET INTO THE MATHEMATICS OF POLICIES AND VALUE FUNCTIONS . WE SEE HOW TO DEFINE THESE MATHEMATICALLY AND THEN SEE THE ASSOCIATED BELLMAN EQUATIONS . LATER WE SEE THE OPTIMISED POLICIES AND BELLMAN FUNCTIONS AND HOW THEY ULTIMATELY LEAD US TO THE FINAL FUNCTION WE ARE LOOKING FOR . THEN WE SEE WHAT PROBLEMS SUCH SOLUTIONS PRESENT AND HOW WE CAN APPROXIMATE THE FUNCTION TO MEET OUR DEMANDS . IF YOU ARE NOT FAMILIAR WITH THE TERMINOLOGIES I WILL RECOMMEND REFERRING THE PREVIOUS TWO ARTICLES .

## POLICIES AND VALUE FUNCTIONS

THE VERY PURPOSE OF ANY REINFORCEMENT PROBLEM IS TO CALCULATE VALUE FUNCTIONS RELATED TO A STATE OR A STATE ACTION PAIR . WHICH IS A MEASURE OF HOW “GOOD”/”BAD ” IT IS FOR AN AGENT IF IT WERE TO START FROM THE PARTICULAR STATE . MATHEMATICALLY THIS GOOD/BAD HAS TO DO WITH WHAT FUTURE REWARDS ONE CAN EXPECT STARTING FROM A GIVEN STATE . THE POLICY FUNCTIONS ARE DEFINED USING WHAT ARE CALLED POLICIES . WE DEFINED POLICIES IN THE FIRST ARTICLE OF REINFORCEMENT LEARNING AS A MAPPING OF STATES TO PROBABILITIES OF SELECTING EACH POSSIBLE ACTION . BELOW IS A FORMAL DEFINITION OF A POLICY , TAKEN FROM AN INTRODUCTION TO REINFORCEMENT LEARNING BY BARTON AND SUTTON :

## THE MATHEMATICS OF VALUE FUNCTIONS

SUPPOSE THE AGENT IS IN A STATE St . THE VALUE FUNCTION FOR THE STATE IS THE EXPECTED RETURN VALUE , GIVEN THE AGENT FOLLOWS A CERTAIN POLICY . MATHEMATICALLY THE VALUE FUNCTION FOR A MARKOV DECISION PROCESS IS :

## E REFERS TO THE EXPECTATION VALUE , Gt IS THE RETURN FUNCTION

THE V pi(s) IS KNOWN AS THE STATE VALUE FUNCTION FOR A POLICY pi . WE DEFINE YET ANOTHER TYPE OF VALUE FUNCTION THAT INCORPORATES THE ACTION TAKEN FROM A CERTAIN STATE FOLLOWING A PARTICULAR POLICY . THIS TYPE OF VALUE FUNCTION KNOWN AS ACTION VALUE FUNCTION , IS DEFINED AS :

## PROPERTY OF VALUE FUNCTIONS

WE INTRODUCE AN IMPORTANT FEATURE OF A VALUE FUNCTION WHICH WILL BE USED IN THE ENTIRE REINFORCEMENT LEARNING ARTICLES AND IN DYNAMIC PROGRAMMING . AND THIS FEATURE IS THE RECURSIVE NATURE OF VALUE FUNCTIONS . SO FOR ANY STATE St AND THE ONES THAT COME IN FUTURE BY FOLLOWING A POLICY (pi) THE FOLLOWING CONDITION STANDS TRUE AS A RESULT OF THE RECURSIVE NATURE :

THE ABOVE EQUATION IS THE BELLMAN EQUATION WE MENTIONED IN THE BEGINNING OF THIS ARTICLE . NOTICE HOW WE STARTED WITH THE STATE VALUE FUNCTION , REPRESENTED THE RETURN FUNCTION AS A FUNCTION OF RETURN FUNCTION AT A FUTURE STEP AND THEN JUST USED THE FACT WE STATED IN THE PREVIOUS ARTICLE . THE BELLMAN EQUATION IS JUST THE PRODUCT OF ALL POLICIES FROM S AND THE SUM OF ALL PROBABILITIES OF TRANSITIONING USING ACTIONS AND GAINING REWARDS . WE KNOW USE THIS RECURSIVE BELLMAN EQUATION TO FIGURE OUT THE OPTIMAL POLICIES AND VALUE FUNCTIONS .

## OPTIMAL POLICIES AND VALUE FUNCTIONS

NOW OF ALL THE POLICIES AND VALUE FUNCTIONS WE WANT THE BEST ONES!! . THE BEST OPTIMAL POLICIES AND FUNCTIONS ARE JUST THE ONES THAT HAVE MAXIMUM VALUES AND HENCE GIVE MAXIMUM RETURN IN THE LONG RUN . THIS IS EXACTLY WHAT AN AGENT DESIRES . NOW WE LOOK AT THE TWO OPTIMAL EQUATIONS , PRACTICALLY THEY ARE NOTHING DIFFERENT FROM THE PREVIOUS ONES , WE JUST HAVE ADDED A “MAX” FUNCTION TO THEM :

### THE OPTIMAL ACTION VALUE FUNCTION . THIS IS WHAT THE AGENT ULTIMATELY HAS TO FIND OUT .

THE ABOVE BELLMAN EQUATION IS THE ONE WE WERE LOOKING FOR !!! BUT…………

THIS COMES AT A COST! THEORETICALLY WE HAVE THE ANSWER . BUT IN REALITY THIS EQUATION IS VERY DIFFICULT TO SOLVE . MEMORY IS A BIG CONSTRAINT IN COMPUTATION AS THE DIMENSIONS INCREASE AND THE NUMBER OF ALL POSSIBLE STATES SHOOT UP EXPONENTIALLY . REMEMBER USING DYNAMIC PROGRAMMING / LOOK UP TABLES IS THE WAY BUT EVEN SMALL GAMES LIKE CHESS BECOME HUGE IN TERMS OF COMPUTATION AND MEMORY AS THE DIMENSIONS INCREASE .

HENCE WE HAVE TO SETTLE FOR APPROXIMATIONS . MOREOVER MARKOV DECISION MAKING BECOMES CUMBERSOME ONCE DIMENSIONALITY INCREASES . IN THE NEXT ARTICLE WE WILL EXPLORE AND TRY TO SOLVE A PROBLEM USING MARKOV PROPERTY . THEN WE WILL HEAD TO MONTE CARLO METHODS OF SOLVING SUCH PROBLEMS .

HAPPY LEARNING!!! MAKE SOME MEMORIES , WE MIGHT NEED THEM!!

## DESCRIBING THE MARKOV DECISION PROCESS IN MACHINE LEARNING

IN THIS POST WE DISCUSS A CERTAIN TYPE OF REINFORCEMENT PROBLEM . WE DEFINE A MERKONIKOV PROCESS AND THEN WE GET TO THE MATHS . BUT AS ALWAYS FIRST WE NEED TO UNDERSTAND A FEW DEFINITIONS . THEN WE TRY TO RELATE WITH A REAL LIFE PROBLEM . WE HAVE ALREADY SEEN HOW AN AGENT INTERACTS WITH ITS ENVIRONMENT AND TRIES TO MAXIMISE HIS LONG TERM GAINS . FOR ANY PROBLEM WE REPRESENT “

1. THE SET OF ALL POSSIBLE STATES BY S AND AND THE STATE AT TIME t BY St
2. THE SET OF ALL POSSIBLE ACTIONS BY A AND THE STATE AT TIME t BY At
3. THE POLICY MAPPING IN A STATE S TO AN ACTION A BY pi(a|s)

## RETURNS

WE HAVE SEEN THAT THE GOAL OF AN AGENT IS TO MAXIMISE THE RETURNS . RETURNS REFER TO THE LONG TERM REWARDS STARTING FROM A PARTICULAR STATE . SO HOW TO REPRESENT IT MATHEMATICALLY. AS STATED IT MUST BE SOME FUNCTION OF PRESENT AND FUTURE REWARDS

SO WHAT IS A MARKOV PROCESS ? WELL WE NEED TO CONSIDER TWO THINGS HERE . FIRST THAT THE AGENT KNOWS WELL WHAT THE ENVIRONMENT EXACTLY IS . SECOND IF AT ANY TIME t , THE AGENT IS IN A STATE S , THE EXPECTED RETURNS FROM THERE ONWARDS DEPENDS ONLY ON THAT STATE AND THE ACTIONS TAKEN FROM THERE . THAT IS HOW YOU ENDED UP IN THAT STATE WON’T EFFECT THE FUTURE RETURNS . THIS IS HOW IT LOOKS LIKE :

## THE DISCOUNT PARAMETER

NOTICE THE PARAMETER GAMMA AND ITS SUBSEQUENT POWERS IN THE EXPRESSION? GAMMA IS REFERRED AS THE “DISCOUNT ” PARAMETER . IT IS A MEASURE THAT ALLOWS US TO SHOW THE DIMINISHING IMPORTANCE OF FUTURE REWARDS . A SMALL GAMMA WOULD TEND TOWARDS A GREEDY APPROACH WHERE THE RETURN FUNCTION FOCUSES ON THE MORE IMMEDIATE REWARDS . IF GAMMA =0 , THE AGENT IS SAID TO BE “MYOPIC ” THAT IS , FOCUSED ON JUST THE IMMEDIATE REWARD . SECONDLY GAMMA HELPS IN CONVERGENCE OF THE RETURN FUNCTION IF THE NUMBER OF TOTAL ALLOWED STEPS /ACTIONS IS NOT PREDEFINED . IN SUCH CASES , IN ABSENSE OF GAMMA , THE RETURN FUNCTION WOULD TURN INTO AN INFINITE SUM .

## THE MARKOV PROPERTY

A MARKONIKOV PROPERTY PROPERTY COMES UNDER A MORE GENERAL DEFINITION OF AN ENVIRONMENT . LETS DISCUSS WHAT DO WE MEAN BY A GENERAL ENVIRONMENT . SUPPOSE THE AGENT IS CURRENTLY IN A STATE St .

NOW IN A MORE GENERAL SENSE WHAT HAPPENS NEXT WOULD DEPEND IN HOW YOU ACTUALLY MADE UP TO THAT PARTICULAR STATE . THAT IS WHAT HAPPENS IS NOT ONLY A FUNCTION OF THE PRESENT STATE , RATHER THAN DEPENDING ON THE WHOLE CUMULATIVE PAST .

A GENERALISED REAL LIFE ENVIRONMENT

SUPPOSE YOU QUALIFIED THE FIRST LEVEL OF A MCQ BASED EXAM . LET SAY THE PREPARATION FOR THE SECOND ROUND IS THE PRESENT STATE St . NOW WHETHER YOU WOULD EXCEL IN THE SECOND ROUND WOULD BE DEPENDENT ON HOW YOU PASSED THE FIRST ROUND . IF HALF OF THE QUESTIONS WERE A GUESS WORK AND YOU WERE LUCKY ENOUGH TO GET PAST THE CUT-OFF SCORE YOUR CHANCES OF QUALIFYING THE NEXT ROUND IS LOW . ON THE OTHER HAND IF YOU SOLVED EVERY QUESTION THEN YOUR CHANCES ARE HIGH .

LETS REPRESENT THE ABOVE STATEMENTS MATHEMATICALLY , SUPPOSE THE HISTORY OF REACHING A STATE St IS REPRESENTED BY ((S0 ,A0, R1) ,(S1 ,A1, R2) ………..(S(t-1) , A(t-1), R(t)) . FOR A GENERAL THE FUTURE RETURNS WOULD DEPEND ON THIS HISTORY . (SO IN THE ABOVE EXAM CASE HOW MANY ACTIONS WERE GUESS WORK AND HOW MANY WERE SOLVING AND ANSWERING )

NOW LETS DISCUSS A SPECIAL CASE OF THE ABOVE GENERAL ENVIRONMENT . SUPPOSE YOU ARE PLAYING A GAME OF CHESS , FOR A PARTICULAR STATE THE FUTURE DEPENDS SOLELY ON HOW YOU HAVE REACHED THERE , AND FROM THERE ONLY THE NEXT MOVES MATTER . ANOTHER EXAMPLE IS SUPPOSE YOU HAVE TO REACH MUMBAI FROM LUCKNOW VIA DELHI . NOW ONCE YOU HAVE REACHED DELHI AIRPORT . OPTIMISING THE TIME TAKEN TO REACH MUMBAI WONT DEPEND ON HOW YOU REACHED DELHI . USING THE ABOVE INTUITION WE DEFINE BELOW WHAT IS KNOWN AS A MARKOV PROCESS .

## THE MATH

WE DEFINE THE PROBABILITY AT STEP t OF TRANSITIONING TO A STATE (s’) FROM (s) BY TAKING AN ACTION (a ) AND GETTING A REWARD (r )

#### NOTICE HOW THE PROBABILITY DEPENDS ONLY ON THE PRESENT STATE AND THE NEXT ACTION

SO NOW WE KNOW WHAT A STATE WHICH FOLLOW MARKOV PROPERTY MEANS . NEXT WE DISCUSS POLICY FUNCTIONS AND VALUE FUNCTIONS WITH RESPECT TO MARKOV DECISION MAKING PROCESSES . WE WILL SEE WHAT DRIVES A REINFORCEMENT LEARNING AGENT AND WE WILL BE INTRODUCED TO ONE OF THE MOST VITAL EQUATIONS IN REINFORCEMENT LEARNING , THE BELLMAN EQUATION.

HAPPY LEARNING!!!

## THE WHAT AND HOW OF REINFORCEMENT LEARNING

APART FROM SUPERVISED AND UNSUPERVISED MACHINE LEARNING , THERE IS A THIRD KIND OF MACHINE LEARNING APPROACH , WIDELY KNOWN AS REINFORCEMENT LEARNING . SO HOW IS THIS APPROACH DIFFERENT FROM THE FORMER TWO . HERE IN THIS ARTICLE WE DISCUSS THE OVER VIEW OF REINFORCEMENT LEARNING ,THE IMPORTANT TOPICS AND THE BASIC APPROACH . THE BASIC DIFFERENCE BETWEEN THE SUPERVISED/UNSUPERVISED AND THE REINFORCEMENT LEARNING APPROACH IS THE ABSENSE OF A PREDEFINED /AVAILABLE DATA SET ON WHICH WE JUST HAVE TO FIT OUR MODEL .

THERE ARE NO CLEAR INSTRUCTIONS /ALGORITHMS THAT STATE EXACTLY WHAT EACH ITERATION HAS TO DO. TO MAKE THIS POINT CLEAR LETS TRY TO VISUALIZE A SINGLE REAL LIFE PROBLEM WITH A SUPERVISED APPROACH AND A REINFORCEMENT LEARNING APPROACH . SUPPOSE A CHILD WANTS TO LEARN HOW TO RIDE A BICYCLE . HOW WOULD A SUPERVISED LEARNING APPROACH LOOK LIKE :

1. TELLING HIM THE EXACT AMOUNT OF FORCE HE HAS TO PUT ON THE PEDAL FOR EVERY POSITION OF THE PEDAL.
2. WATCHING THOUSANDS OF VIDEOS OF PEOPLE CYCLING AND TRYING TO LEARN FROM IT .
3. A LOSS FUNCTION THAT SHOWS HOW MUCH THE CENTRE OF GRAVITY IS AWAY FROM THE IDEAL POSITION .

I’M SURE YOU DIDN’T LEARN CYCLING THIS WAY . (AND NO LUCKY CHILD EVER WILL!!!) . IN FACT THE ONLY THING THAT COMES INTO THE MIND AS A CHILD WHILE LEARNING TO CYCLE IS ” I DON’T HAVE TO FALL ” . AND HOW DO YOU LEARN ?

1. YOU BEGIN BY KNOWING NOTHING EXCEPT THE FACT THAT WHAT YOU HAVE TO DO . IT INCLUDES THE GOAL AND THE FACT THAT PEDDLING AND BALANCING ARE THE ONLY THINGS YOU CAN DO .
2. TRYING A FEW TIMES , MAYBE FALLING .
3. AFTER FALLING YOU AVOID WHAT YOU DID THAT RESULTED IN A FALL .
4. AND AFTER TRYING A NUMBER OF TIMES YOU SUCCEED .

### THE BASIC DIFFERENCE

ISN’T IT STRANGE THAT EVEN AFTER A CHILD MASTERS HOW TO RIDE HIS CYCLE HE HAS NO IDEA ABOUT THE RULES THAT GOVERN THE MECHANICS /DYNAMICS OF THE CYCLE . YET HE LEARNT HOW TO BALANCE THE FORCES, THE TORQUES AND HOW TO SHIFT HIS CENTRE OF GRAVITY WHILE MAKING A TURN. THIS IS THE PRINCIPLE BEHIND REINFORCEMENT LEARNING . RATHER THAN HAVING A PREDEFINED SET OF RULES IT IS RATHER ” EXPLORATORY ” .

THE “AGENT ” – THE ONE WHO IS TRYING TO REACH THE GOAL , TRIES ALL THE “ACTIONS” THAT HE CAN PERFORM IN THE “ENVIRONMENT ” AVAILABLE TO HIM , THE ENVIRONMENT MIGHT BE COMPLETELY “KNOWN” OR HE MIGHT FACE “UNKNOWN ” REGIONS . LATER EVEN AFTER HE LEARNS A WAY THAT TAKES HIM TO THE GOAL HE TRIES TO “EXPLORE ” NEW WAYS TO FIND IF BETTER WAYS EXIST . LOOK AT THESE TWO CHARACTERISTICS OF REINFORCEMENT LEARNING :

THIS “EXPLORATION ” IS ONE OF THE CHARACTERISTIC DIFFERENCES BETWEEN REINFORCEMENT LEARNING AND OTHER FORMS OF MACHINE LEARNING . THERE IS TRADE OFF BETWEEN EXPLOITATION(GREEDY APPROACH) OR EXPLORATION .

IN REINFORCEMENT LEARNING EXPLICITLY CONSIDERS THE WHOLE PROBLEM OF A GOAL DIRECTED AGENT , IN CONTRAST TO OTHER LEARNING APPROACHES WHICH TEND TO DIVIDE THE PROBLEM INTO SUB PROBLEMS AND TRY TO OPTIMISE THEM WITHOUT WORRYING ABOUT THE LARGER PICTURE .

## TOWARDS THE MATH

WE START BY ADDRESSING THE BASIC TERMINOLOGIES THAT WE WILL USE :

### POLICY

A POLICY IS HOW THE LEARNING AGENT BEHAVES AT A GIVEN TIME/STATE IN THE ENVIRONMENT. IN OTHER WORDS POLICY IS A MAPPING FROM A STATE TO THE ACTIONS TO BE TAKEN WHEN IN THOSE STATES . IT MAY BE A FUNCTION , A LOOK UP TABLE . IN SIMPLE WORDS ITS THE WAYS YOU CAN TAKE ACTION WHEN YOU FACE A SITUATION . FOR EXAMPLE IF A TIGER SUDDENLY APPEARS IN YOUR ROOM, RUNNING AND SHOUTING IS A POSSIBLE POLICY , WHILE GREETING THE TIGER IS NOT (JUST DON’T, NO).

### REWARD SIGNAL

IN REINFORCEMENT LEARNING A ” GOAL ” IS MATHEMATICALLY REPRESENTED AS A REWARD SIGNAL . ON EVERY ACTION TAKEN , THE ENVIRONMENT SENDS A NUMBER , CALLED A REWARD TO SHOW WHETHER THAT PARTICULAR ACTION WORSEN OR MADE THE SITUATION BETTER . THE AGENTS GOAL IS TO MAXIMISE HIS REWARDS OVER THE LONG RUN . A PROFITABLE ACTION(WINNING ) GIVES YOU A LARGE POSITIVE REWARD , GETTING INTO TROUBLE (UNDESIRED STATES) A LARGE NEGATIVE , WHILE TAKING STEPS AND YIELDING NOTHING MIGHT BE A SMALL NEGATIVE REWARD( AS TO OPTIMISE REDUNDANT ACTIONS).

### VALUE FUNCTION

REWARD SIGNAL SHOWS HOW GOOD / BAD AN ACTION IS IN AN IMMEDIATE SENSE , WHILE VALUE FUNCTION SHOWS A PICTURE IN THE LONG RUN . SO VALUE FUNCTION IS THE AMOUNT OF REWARD AN AGENT CAN EXPECT TO RECIEVE IN THE FUTURE STARTING FROM A PARTICULAR STATE . TO UNDERSTAND THIS SUPPOSE A GUY GETS SOME MONEY TO SPEND . IMMEDIATE “REWARDS” LIKE EXPENSIVE WATCHES , PARTIES MIGHT SEEM TO BE GOOD IN THE IMMEDIATE SENSE BUT SAVING NOW AND INVESTING WILL INCREASE HIS MONEY AND WILL BE MORE REWARDING IN THE LONG RUN .

### EXPLOITATION AND EXPLORATION

EXPLOITATION REFERS TO SELECTING THE BEST ALTERNATIVE FROM THE POLICIES WHEN PRESENT IN A PARTICULAR STATE . EXPLORATION REFERS TO RANDOMLY SELECTING ONE OF THE ACTIONS (EXPLORING) IN HOPE TO FIND BETTER REWARDS IN THE LONG RUN . THIS IS ONE OF THE STRIKING CHARACTERISITCS OF REINFORCEMENT LEARNING .

### RETURNS

THE CUMULATIVE REWARD FUNCTION THAT THE AGENT HAS TO MAXIMIZE.

NOW LETS HAVE A LOOK AS TO HOW THE AGENT AND ITS ENVIRONMENT INTERACTS :

### THE AGENT IN A STATE PERFORMS AN ACTION INTERACTING WITH THE ENVIRONMENT ,PRODUCING A NEW STATE AND A REWARD

ABOVE IS THE BASIC STRUCTURE OF HOW AN AGENT INTERACTS WITH THE ENVIRONMENT . FURTHER REINFORCEMENT LEARNING PROBLEMS CAN BE CATEGORISED INTO MANY CATEGORIES . WE WILL DISCUSS THE THREE MAJOR VARIETIES OF REINFORCEMENT LEARNING :

THE FINITE MARKOV DECISION PROCESSES WHICH WILL INCLUDE DISCUSSING VALUE FUNCTIONS , BELLMAN EQUATIONS , THE MARKOV PROPERTY AND RETURNS . FOLLOWED BY MARKOV DECISION PROCESS( MDP ) WE WILL DISCUSS THE MONTE CARLO METHODS , MONTE CARLO ESTIMATION OF ACTION VALUES , OFF POLICY AND ON POLICY APPROACHES AND MONTE CARLO CONTROL .

FOLLOWED BY MONTE CARLO WE WILL DISCUSS TEMPORAL DIFFERENCE LEARNING . WE WILL LOOK AT REAL LIFE PROBLEMS AND COMPARE THESE THREE TYPES OF LEARNING . THE MATHEMATICS MOSTLY USES PROBABILITIES AND MATRICES FOR LOOK UP /SEARCH TABLES .

### APPLICATIONS

CURRENTLY REINFORCEMENT LEARNING IS BEING EXTENSIVELY RESEARCHED FOR DEVELOPING MACHINES WHICH LEARN TO PLAY AND WIN AGAINST THE BEST PLAYERS OF THE WORLD . IN 2016, GOOGLE DEEP MIND TRAINED A REINFORCEMENT LEARNING MODEL “ALPHA ZERO ” WHICH LEARNED TO PLAY “GO” . GO IS ONE OF THE MOST COMPUTATIONALLY DIFFICULT GAMES IN THE WORLD BECAUSE THE NUMBER OF POSSIBLE STATES IS EVEN MORE THAN THE NUMBER OF ATOMS IN THE UNIVERSE . FOR A GO BOARD OF SIZE (N*N) . THE NUMBER OF POSSIBLE STATES IS ( 3 POW(N SQUARED)) . IMAGINE THE NUMBER OF POSSIBLE STATES FOR A (19 *19) BOARD. SUCH IS “THE CURSE OF DIMENSIONALITY .”