battle, chess, checkmate

REINFORCEMENT LEARNING

THE WHAT AND HOW OF REINFORCEMENT LEARNING

APART FROM SUPERVISED AND UNSUPERVISED MACHINE LEARNING , THERE IS A THIRD KIND OF MACHINE LEARNING APPROACH , WIDELY KNOWN AS REINFORCEMENT LEARNING . SO HOW IS THIS APPROACH DIFFERENT FROM THE FORMER TWO . HERE IN THIS ARTICLE WE DISCUSS THE OVER VIEW OF REINFORCEMENT LEARNING ,THE IMPORTANT TOPICS AND THE BASIC APPROACH . THE BASIC DIFFERENCE BETWEEN THE SUPERVISED/UNSUPERVISED AND THE REINFORCEMENT LEARNING APPROACH IS THE ABSENSE OF A PREDEFINED /AVAILABLE DATA SET ON WHICH WE JUST HAVE TO FIT OUR MODEL .

THERE ARE NO CLEAR INSTRUCTIONS /ALGORITHMS THAT STATE EXACTLY WHAT EACH ITERATION HAS TO DO. TO MAKE THIS POINT CLEAR LETS TRY TO VISUALIZE A SINGLE REAL LIFE PROBLEM WITH A SUPERVISED APPROACH AND A REINFORCEMENT LEARNING APPROACH . SUPPOSE A CHILD WANTS TO LEARN HOW TO RIDE A BICYCLE . HOW WOULD A SUPERVISED LEARNING APPROACH LOOK LIKE :

  1. TELLING HIM THE EXACT AMOUNT OF FORCE HE HAS TO PUT ON THE PEDAL FOR EVERY POSITION OF THE PEDAL.
  2. WATCHING THOUSANDS OF VIDEOS OF PEOPLE CYCLING AND TRYING TO LEARN FROM IT .
  3. A LOSS FUNCTION THAT SHOWS HOW MUCH THE CENTRE OF GRAVITY IS AWAY FROM THE IDEAL POSITION .

I’M SURE YOU DIDN’T LEARN CYCLING THIS WAY . (AND NO LUCKY CHILD EVER WILL!!!) . IN FACT THE ONLY THING THAT COMES INTO THE MIND AS A CHILD WHILE LEARNING TO CYCLE IS ” I DON’T HAVE TO FALL ” . AND HOW DO YOU LEARN ?

  1. YOU BEGIN BY KNOWING NOTHING EXCEPT THE FACT THAT WHAT YOU HAVE TO DO . IT INCLUDES THE GOAL AND THE FACT THAT PEDDLING AND BALANCING ARE THE ONLY THINGS YOU CAN DO .
  2. TRYING A FEW TIMES , MAYBE FALLING .
  3. AFTER FALLING YOU AVOID WHAT YOU DID THAT RESULTED IN A FALL .
  4. AND AFTER TRYING A NUMBER OF TIMES YOU SUCCEED .

THE BASIC DIFFERENCE

ISN’T IT STRANGE THAT EVEN AFTER A CHILD MASTERS HOW TO RIDE HIS CYCLE HE HAS NO IDEA ABOUT THE RULES THAT GOVERN THE MECHANICS /DYNAMICS OF THE CYCLE . YET HE LEARNT HOW TO BALANCE THE FORCES, THE TORQUES AND HOW TO SHIFT HIS CENTRE OF GRAVITY WHILE MAKING A TURN. THIS IS THE PRINCIPLE BEHIND REINFORCEMENT LEARNING . RATHER THAN HAVING A PREDEFINED SET OF RULES IT IS RATHER ” EXPLORATORY ” .

THE “AGENT ” – THE ONE WHO IS TRYING TO REACH THE GOAL , TRIES ALL THE “ACTIONS” THAT HE CAN PERFORM IN THE “ENVIRONMENT ” AVAILABLE TO HIM , THE ENVIRONMENT MIGHT BE COMPLETELY “KNOWN” OR HE MIGHT FACE “UNKNOWN ” REGIONS . LATER EVEN AFTER HE LEARNS A WAY THAT TAKES HIM TO THE GOAL HE TRIES TO “EXPLORE ” NEW WAYS TO FIND IF BETTER WAYS EXIST . LOOK AT THESE TWO CHARACTERISTICS OF REINFORCEMENT LEARNING :

THIS “EXPLORATION ” IS ONE OF THE CHARACTERISTIC DIFFERENCES BETWEEN REINFORCEMENT LEARNING AND OTHER FORMS OF MACHINE LEARNING . THERE IS TRADE OFF BETWEEN EXPLOITATION(GREEDY APPROACH) OR EXPLORATION .

IN REINFORCEMENT LEARNING EXPLICITLY CONSIDERS THE WHOLE PROBLEM OF A GOAL DIRECTED AGENT , IN CONTRAST TO OTHER LEARNING APPROACHES WHICH TEND TO DIVIDE THE PROBLEM INTO SUB PROBLEMS AND TRY TO OPTIMISE THEM WITHOUT WORRYING ABOUT THE LARGER PICTURE .

TOWARDS THE MATH

WE START BY ADDRESSING THE BASIC TERMINOLOGIES THAT WE WILL USE :

POLICY

A POLICY IS HOW THE LEARNING AGENT BEHAVES AT A GIVEN TIME/STATE IN THE ENVIRONMENT. IN OTHER WORDS POLICY IS A MAPPING FROM A STATE TO THE ACTIONS TO BE TAKEN WHEN IN THOSE STATES . IT MAY BE A FUNCTION , A LOOK UP TABLE . IN SIMPLE WORDS ITS THE WAYS YOU CAN TAKE ACTION WHEN YOU FACE A SITUATION . FOR EXAMPLE IF A TIGER SUDDENLY APPEARS IN YOUR ROOM, RUNNING AND SHOUTING IS A POSSIBLE POLICY , WHILE GREETING THE TIGER IS NOT (JUST DON’T, NO).

REWARD SIGNAL

IN REINFORCEMENT LEARNING A ” GOAL ” IS MATHEMATICALLY REPRESENTED AS A REWARD SIGNAL . ON EVERY ACTION TAKEN , THE ENVIRONMENT SENDS A NUMBER , CALLED A REWARD TO SHOW WHETHER THAT PARTICULAR ACTION WORSEN OR MADE THE SITUATION BETTER . THE AGENTS GOAL IS TO MAXIMISE HIS REWARDS OVER THE LONG RUN . A PROFITABLE ACTION(WINNING ) GIVES YOU A LARGE POSITIVE REWARD , GETTING INTO TROUBLE (UNDESIRED STATES) A LARGE NEGATIVE , WHILE TAKING STEPS AND YIELDING NOTHING MIGHT BE A SMALL NEGATIVE REWARD( AS TO OPTIMISE REDUNDANT ACTIONS).

VALUE FUNCTION

REWARD SIGNAL SHOWS HOW GOOD / BAD AN ACTION IS IN AN IMMEDIATE SENSE , WHILE VALUE FUNCTION SHOWS A PICTURE IN THE LONG RUN . SO VALUE FUNCTION IS THE AMOUNT OF REWARD AN AGENT CAN EXPECT TO RECIEVE IN THE FUTURE STARTING FROM A PARTICULAR STATE . TO UNDERSTAND THIS SUPPOSE A GUY GETS SOME MONEY TO SPEND . IMMEDIATE “REWARDS” LIKE EXPENSIVE WATCHES , PARTIES MIGHT SEEM TO BE GOOD IN THE IMMEDIATE SENSE BUT SAVING NOW AND INVESTING WILL INCREASE HIS MONEY AND WILL BE MORE REWARDING IN THE LONG RUN .

EXPLOITATION AND EXPLORATION

EXPLOITATION REFERS TO SELECTING THE BEST ALTERNATIVE FROM THE POLICIES WHEN PRESENT IN A PARTICULAR STATE . EXPLORATION REFERS TO RANDOMLY SELECTING ONE OF THE ACTIONS (EXPLORING) IN HOPE TO FIND BETTER REWARDS IN THE LONG RUN . THIS IS ONE OF THE STRIKING CHARACTERISITCS OF REINFORCEMENT LEARNING .

RETURNS

THE CUMULATIVE REWARD FUNCTION THAT THE AGENT HAS TO MAXIMIZE.

NOW LETS HAVE A LOOK AS TO HOW THE AGENT AND ITS ENVIRONMENT INTERACTS :

REINFORCEMENT LEARNING AGENT

THE AGENT IN A STATE PERFORMS AN ACTION INTERACTING WITH THE ENVIRONMENT ,PRODUCING A NEW STATE AND A REWARD

ABOVE IS THE BASIC STRUCTURE OF HOW AN AGENT INTERACTS WITH THE ENVIRONMENT . FURTHER REINFORCEMENT LEARNING PROBLEMS CAN BE CATEGORISED INTO MANY CATEGORIES . WE WILL DISCUSS THE THREE MAJOR VARIETIES OF REINFORCEMENT LEARNING :

THE FINITE MARKOV DECISION PROCESSES WHICH WILL INCLUDE DISCUSSING VALUE FUNCTIONS , BELLMAN EQUATIONS , THE MARKOV PROPERTY AND RETURNS . FOLLOWED BY MARKOV DECISION PROCESS( MDP ) WE WILL DISCUSS THE MONTE CARLO METHODS , MONTE CARLO ESTIMATION OF ACTION VALUES , OFF POLICY AND ON POLICY APPROACHES AND MONTE CARLO CONTROL .

FOLLOWED BY MONTE CARLO WE WILL DISCUSS TEMPORAL DIFFERENCE LEARNING . WE WILL LOOK AT REAL LIFE PROBLEMS AND COMPARE THESE THREE TYPES OF LEARNING . THE MATHEMATICS MOSTLY USES PROBABILITIES AND MATRICES FOR LOOK UP /SEARCH TABLES .

APPLICATIONS

CURRENTLY REINFORCEMENT LEARNING IS BEING EXTENSIVELY RESEARCHED FOR DEVELOPING MACHINES WHICH LEARN TO PLAY AND WIN AGAINST THE BEST PLAYERS OF THE WORLD . IN 2016, GOOGLE DEEP MIND TRAINED A REINFORCEMENT LEARNING MODEL “ALPHA ZERO ” WHICH LEARNED TO PLAY “GO” . GO IS ONE OF THE MOST COMPUTATIONALLY DIFFICULT GAMES IN THE WORLD BECAUSE THE NUMBER OF POSSIBLE STATES IS EVEN MORE THAN THE NUMBER OF ATOMS IN THE UNIVERSE . FOR A GO BOARD OF SIZE (N*N) . THE NUMBER OF POSSIBLE STATES IS ( 3 POW(N SQUARED)) . IMAGINE THE NUMBER OF POSSIBLE STATES FOR A (19 *19) BOARD. SUCH IS “THE CURSE OF DIMENSIONALITY .”

Leave a Comment

Your email address will not be published. Required fields are marked *