battle, chess, checkmate

REINFORCEMENT LEARNING

THE WHAT AND HOW OF REINFORCEMENT LEARNING

APART FROM SUPERVISED AND UNSUPERVISED MACHINE LEARNING , THERE IS A THIRD KIND OF MACHINE LEARNING APPROACH , WIDELY KNOWN AS REINFORCEMENT LEARNING . SO HOW IS THIS APPROACH DIFFERENT FROM THE FORMER TWO . HERE IN THIS ARTICLE WE DISCUSS THE OVER VIEW OF REINFORCEMENT LEARNING ,THE IMPORTANT TOPICS AND THE BASIC APPROACH . THE BASIC DIFFERENCE BETWEEN THE SUPERVISED/UNSUPERVISED AND THE REINFORCEMENT LEARNING APPROACH IS THE ABSENSE OF A PREDEFINED /AVAILABLE DATA SET ON WHICH WE JUST HAVE TO FIT OUR MODEL .

THERE ARE NO CLEAR INSTRUCTIONS /ALGORITHMS THAT STATE EXACTLY WHAT EACH ITERATION HAS TO DO. TO MAKE THIS POINT CLEAR LETS TRY TO VISUALIZE A SINGLE REAL LIFE PROBLEM WITH A SUPERVISED APPROACH AND A REINFORCEMENT LEARNING APPROACH . SUPPOSE A CHILD WANTS TO LEARN HOW TO RIDE A BICYCLE . HOW WOULD A SUPERVISED LEARNING APPROACH LOOK LIKE :

  1. TELLING HIM THE EXACT AMOUNT OF FORCE HE HAS TO PUT ON THE PEDAL FOR EVERY POSITION OF THE PEDAL.
  2. WATCHING THOUSANDS OF VIDEOS OF PEOPLE CYCLING AND TRYING TO LEARN FROM IT .
  3. A LOSS FUNCTION THAT SHOWS HOW MUCH THE CENTRE OF GRAVITY IS AWAY FROM THE IDEAL POSITION .

I’M SURE YOU DIDN’T LEARN CYCLING THIS WAY . (AND NO LUCKY CHILD EVER WILL!!!) . IN FACT THE ONLY THING THAT COMES INTO THE MIND AS A CHILD WHILE LEARNING TO CYCLE IS ” I DON’T HAVE TO FALL ” . AND HOW DO YOU LEARN ?

  1. YOU BEGIN BY KNOWING NOTHING EXCEPT THE FACT THAT WHAT YOU HAVE TO DO . IT INCLUDES THE GOAL AND THE FACT THAT PEDDLING AND BALANCING ARE THE ONLY THINGS YOU CAN DO .
  2. TRYING A FEW TIMES , MAYBE FALLING .
  3. AFTER FALLING YOU AVOID WHAT YOU DID THAT RESULTED IN A FALL .
  4. AND AFTER TRYING A NUMBER OF TIMES YOU SUCCEED .

THE BASIC DIFFERENCE

ISN’T IT STRANGE THAT EVEN AFTER A CHILD MASTERS HOW TO RIDE HIS CYCLE HE HAS NO IDEA ABOUT THE RULES THAT GOVERN THE MECHANICS /DYNAMICS OF THE CYCLE . YET HE LEARNT HOW TO BALANCE THE FORCES, THE TORQUES AND HOW TO SHIFT HIS CENTRE OF GRAVITY WHILE MAKING A TURN. THIS IS THE PRINCIPLE BEHIND REINFORCEMENT LEARNING . RATHER THAN HAVING A PREDEFINED SET OF RULES IT IS RATHER ” EXPLORATORY ” .

THE “AGENT ” – THE ONE WHO IS TRYING TO REACH THE GOAL , TRIES ALL THE “ACTIONS” THAT HE CAN PERFORM IN THE “ENVIRONMENT ” AVAILABLE TO HIM , THE ENVIRONMENT MIGHT BE COMPLETELY “KNOWN” OR HE MIGHT FACE “UNKNOWN ” REGIONS . LATER EVEN AFTER HE LEARNS A WAY THAT TAKES HIM TO THE GOAL HE TRIES TO “EXPLORE ” NEW WAYS TO FIND IF BETTER WAYS EXIST . LOOK AT THESE TWO CHARACTERISTICS OF REINFORCEMENT LEARNING :

THIS “EXPLORATION ” IS ONE OF THE CHARACTERISTIC DIFFERENCES BETWEEN REINFORCEMENT LEARNING AND OTHER FORMS OF MACHINE LEARNING . THERE IS TRADE OFF BETWEEN EXPLOITATION(GREEDY APPROACH) OR EXPLORATION .

IN REINFORCEMENT LEARNING EXPLICITLY CONSIDERS THE WHOLE PROBLEM OF A GOAL DIRECTED AGENT , IN CONTRAST TO OTHER LEARNING APPROACHES WHICH TEND TO DIVIDE THE PROBLEM INTO SUB PROBLEMS AND TRY TO OPTIMISE THEM WITHOUT WORRYING ABOUT THE LARGER PICTURE .

TOWARDS THE MATH

WE START BY ADDRESSING THE BASIC TERMINOLOGIES THAT WE WILL USE :

POLICY

A POLICY IS HOW THE LEARNING AGENT BEHAVES AT A GIVEN TIME/STATE IN THE ENVIRONMENT. IN OTHER WORDS POLICY IS A MAPPING FROM A STATE TO THE ACTIONS TO BE TAKEN WHEN IN THOSE STATES . IT MAY BE A FUNCTION , A LOOK UP TABLE . IN SIMPLE WORDS ITS THE WAYS YOU CAN TAKE ACTION WHEN YOU FACE A SITUATION . FOR EXAMPLE IF A TIGER SUDDENLY APPEARS IN YOUR ROOM, RUNNING AND SHOUTING IS A POSSIBLE POLICY , WHILE GREETING THE TIGER IS NOT (JUST DON’T, NO).

REWARD SIGNAL

IN REINFORCEMENT LEARNING A ” GOAL ” IS MATHEMATICALLY REPRESENTED AS A REWARD SIGNAL . ON EVERY ACTION TAKEN , THE ENVIRONMENT SENDS A NUMBER , CALLED A REWARD TO SHOW WHETHER THAT PARTICULAR ACTION WORSEN OR MADE THE SITUATION BETTER . THE AGENTS GOAL IS TO MAXIMISE HIS REWARDS OVER THE LONG RUN . A PROFITABLE ACTION(WINNING ) GIVES YOU A LARGE POSITIVE REWARD , GETTING INTO TROUBLE (UNDESIRED STATES) A LARGE NEGATIVE , WHILE TAKING STEPS AND YIELDING NOTHING MIGHT BE A SMALL NEGATIVE REWARD( AS TO OPTIMISE REDUNDANT ACTIONS).

VALUE FUNCTION

REWARD SIGNAL SHOWS HOW GOOD / BAD AN ACTION IS IN AN IMMEDIATE SENSE , WHILE VALUE FUNCTION SHOWS A PICTURE IN THE LONG RUN . SO VALUE FUNCTION IS THE AMOUNT OF REWARD AN AGENT CAN EXPECT TO RECIEVE IN THE FUTURE STARTING FROM A PARTICULAR STATE . TO UNDERSTAND THIS SUPPOSE A GUY GETS SOME MONEY TO SPEND . IMMEDIATE “REWARDS” LIKE EXPENSIVE WATCHES , PARTIES MIGHT SEEM TO BE GOOD IN THE IMMEDIATE SENSE BUT SAVING NOW AND INVESTING WILL INCREASE HIS MONEY AND WILL BE MORE REWARDING IN THE LONG RUN .

EXPLOITATION AND EXPLORATION

EXPLOITATION REFERS TO SELECTING THE BEST ALTERNATIVE FROM THE POLICIES WHEN PRESENT IN A PARTICULAR STATE . EXPLORATION REFERS TO RANDOMLY SELECTING ONE OF THE ACTIONS (EXPLORING) IN HOPE TO FIND BETTER REWARDS IN THE LONG RUN . THIS IS ONE OF THE STRIKING CHARACTERISITCS OF REINFORCEMENT LEARNING .

RETURNS

THE CUMULATIVE REWARD FUNCTION THAT THE AGENT HAS TO MAXIMIZE.

NOW LETS HAVE A LOOK AS TO HOW THE AGENT AND ITS ENVIRONMENT INTERACTS :

REINFORCEMENT LEARNING AGENT

THE AGENT IN A STATE PERFORMS AN ACTION INTERACTING WITH THE ENVIRONMENT ,PRODUCING A NEW STATE AND A REWARD

ABOVE IS THE BASIC STRUCTURE OF HOW AN AGENT INTERACTS WITH THE ENVIRONMENT . FURTHER REINFORCEMENT LEARNING PROBLEMS CAN BE CATEGORISED INTO MANY CATEGORIES . WE WILL DISCUSS THE THREE MAJOR VARIETIES OF REINFORCEMENT LEARNING :

THE FINITE MARKOV DECISION PROCESSES WHICH WILL INCLUDE DISCUSSING VALUE FUNCTIONS , BELLMAN EQUATIONS , THE MARKOV PROPERTY AND RETURNS . FOLLOWED BY MARKOV DECISION PROCESS( MDP ) WE WILL DISCUSS THE MONTE CARLO METHODS , MONTE CARLO ESTIMATION OF ACTION VALUES , OFF POLICY AND ON POLICY APPROACHES AND MONTE CARLO CONTROL .

FOLLOWED BY MONTE CARLO WE WILL DISCUSS TEMPORAL DIFFERENCE LEARNING . WE WILL LOOK AT REAL LIFE PROBLEMS AND COMPARE THESE THREE TYPES OF LEARNING . THE MATHEMATICS MOSTLY USES PROBABILITIES AND MATRICES FOR LOOK UP /SEARCH TABLES .

APPLICATIONS

CURRENTLY REINFORCEMENT LEARNING IS BEING EXTENSIVELY RESEARCHED FOR DEVELOPING MACHINES WHICH LEARN TO PLAY AND WIN AGAINST THE BEST PLAYERS OF THE WORLD . IN 2016, GOOGLE DEEP MIND TRAINED A REINFORCEMENT LEARNING MODEL “ALPHA ZERO ” WHICH LEARNED TO PLAY “GO” . GO IS ONE OF THE MOST COMPUTATIONALLY DIFFICULT GAMES IN THE WORLD BECAUSE THE NUMBER OF POSSIBLE STATES IS EVEN MORE THAN THE NUMBER OF ATOMS IN THE UNIVERSE . FOR A GO BOARD OF SIZE (N*N) . THE NUMBER OF POSSIBLE STATES IS ( 3 POW(N SQUARED)) . IMAGINE THE NUMBER OF POSSIBLE STATES FOR A (19 *19) BOARD. SUCH IS “THE CURSE OF DIMENSIONALITY .”

mathematics, formula, physics

K MEANS CLUSTERING IN MACHINE LEARNING

THE K MEANS CLUSTERING CLASSIFICATION ALGORITHM USED IN MACHINE LEARNING

WE HAVE SEEN HOW CLASSIFICATION PROBLEMS ARE TACKLED USING LOGISTIC REGRESSIONS . HERE WE DISCUSS AN ALGORITHM THAT HELPS US TO CLASSIFY THINGS INTO MULTI -CLASSES . THE INTERESTING PART IS THAT THERE ARE NO LABELLED TAGS ASSOCIATED WITH THE DATA POINTS TO TELL US TO WHICH CLASS A CERTAIN DATA INSTANCE BELONGS(K MEANS CLUSTERING IN MACHINE LEARNING NOT TO BE CONFUSED WITH K NEAREST NEIGHBOURS WHERE WE NEED LABELED DATA) MAKING IT AN UNSUPERVISED MACHINE LEARNING PROBLEM . LETS MAKE THIS POINT CLEAR BY CONSIDERING A REAL LIFE EXAMPLE WHERE WE AS HUMANS HAVE CLASSIFIED NUMEROUS UNLABELLED DATA. ANY LIVING CREATURE IS CLASSIFIED AS AN ANIMAL OR A PLANT .FURTHER WE ASSOCIATE THOUSANDS OF FEATURES TO MAKE CLASSIFICATIONS LIKE THE KINGDOM , CLASS , ORDER AND FAMILY . BUT NOTICE HOW NO ANIMAL HAS A TAG ON IT SAYING I BELONG TO SO AND SO CATEGORY . SO WHEN WE ENCOUNTER A NEW SPECIES HOW DO WE DECIDE AS TO WHICH CLASS THEY BELONG TO .

MOREOVER WHAT IS THE LEVEL OF CLASSIFICATION REQUIRED DEPENDS ON THE PROBLEM STATEMENT . SOMEONE MIGHT BE INTERESTED IN FULL ROOT LEVELS OF CLASSIFICATION , LIKE A RESEARCHER , WHILE FOR SOME THE DIFFERENCE BETWEEN BEING A REPTILE OR A BIRD IS ENOUGH . THIS LEADS TO A MAJOR CONCLUSION . DEPENDING ON HOW COMPLEX OUR CLASSES ARE WE CAN HAVE POINTS WHICH FALL TOGETHER IN A CERTAIN CLASS FOR ONE LEVEL OF CLASSIFICATION WHILE THEY MAY CHANGE CLASSES IF COMPLEXITY INCREASES .

EXAMPLE A PIGEON AND A RABBIT FALL UNDER THE SAME CLASS IF THE DIVISION IS JUST BASED ON WHETHER A CERTAIN ANIMAL LIVES IN WATER OR NOT . BUT THEY FALL IN DIFFERENT CLASSES IF FURTHER DETAILS ARE CONSIDERED .

WHAT DOES ” K ” SIGNIFY

THE DIFFICULTY /COMPLEXITY OF A PROBLEM LIES IN THE FACT THAT INTO HOW MANY CLASSES ONE HAS TO DISTRIBUTE THE DATA INSTANCES .

IN MACHINE LEARNING THIS IS THE BASIC IDEA BEHIND K MEANS CLUSTERING . THE VALUE OF K SHOWS HOW MANY “CLASSES ” WE ARE CONSIDERING . IN OTHER WORDS THE NUMBER OF CENTROIDS OUR ALGORITHM WILL USE . HENCE A LARGER K IMPLIES MAKING THE CLASSIFICATION MORE STRICTER . THEORETICALLY ONE CAN HAVE AS MANY CLASSES AS THERE ARE DATA POINTS AVAILABLE IN THE DATA SET . THAT WOULD BE BEING SO STRICT THAT EVERY OBJECT BECOMES A CLASS AS WELL AS THE ONLY MEMBER OF THE CLASS!!!

HOW TO MEASURE “CLOSENESS” : DISTANCE AND ITS TYPES

OBVIOUSLY THINGS THAT ARE SIMILAR OR A RELATED “CLOSELY” TEND TO FALL WITHIN SAME CLASSES .MATHEMATICALLY CLOSENESS REFERS TO THE THE DISTANCE BETWEEN TWO POINTS : DISTANCES ARE OF THE FOLLOWING TYPES :

  1. EUCLIDEAN
  2. MANHATTAN
  3. MINKOWSKI
  4. HAMMIMG
PLOT OF K MEANS CLUSTERING

THE BLACK POINTS ARE THE CENTROID POINTS , 3 CENTROIDS RESULT IN CLASSIFICATION INTO 3 GROUPS

IN K MEANS CLUSTERING WE USE THE WELL KNOWN EUCLIDEAN DISTANCE METRIC . LETS SEE THE ALGORITHM:

  1. YOU HAVE THE DATA SET (UNLABELED ) PLOTTED .
  2. CHOOSE THE VALUE OF K – THE NUMBER OF CLASSES YOU WANT
  3. RANDOMLY DRAW K POINTS ON THE PLOT (THESE ARE THE K CENTROIDS ) .
  4. FOR EVERY POINT CALCULATE THE K DISTANCES (DISTANCE FROM EACH CENTROID ).
  5. ASSOCIATE THE POINT WITH THE CENTROID WITH WHICH IT HAS THE MINIMUM DISTANCE .
  6. NOW YOU HAVE DIVIDED THE DATA SET POINTS INTO K SETS , EACH SET HAS POINTS THAT ARE NEAREST TO A PARTICULAR CENTROID .
  7. NOW SUPPOSE IN A PARTICULAR SET S ,THERE ARE M POINTS , CALCULATE THE MEAN COORDINATE OF THESE M POINTS .
  8. THIS MEAN COORDINATE IS THE NEW CENTROID . DO THIS FOR ALL K SETS . WE GET K UPDATED CENTROID POINTS
  9. REPEAT FROM STEP 4 UNTIL IN ANY ITERATION NONE OF THE POINTS CHANGE THEIR SET .
K MEANS ALGORITHM MACHINE LEARNING

AN ALGORITHMIC DEFINITION OF THE K MEANS APPROACH

FOLLOWING IS THE MORE MATHEMATICAL DEFINITION FOR PEOPLE WHO WANT A DEEPER UNDERSTANDING :

K MEANS OBJECTIVE FUNCTION


HOW DO WE DECIDE THE BEST K VALUES FOR OUR DATA SET ?

NOT ALL DATA SETS ARE THE SAME , SOME COULD BE EASILY LINEARLY SEPARABLE , HENCE K=2 WOULD BE ENOUGH . BUT IN MANY CASES THIS IS NOT POSSIBLE . IT ALSO VARIES ACCORDING TO THE COMPLEXITY OF THE PROBLEM . WE USE THE ELBOW METHOD TO DECIDE THE IDEAL VALUE OF K FOR A PARTICULAR DATA SET :

THIS IS TO ENSURE THAT MODEL DOESN’T GET OVER FIT . SURELY ADDING MORE CLASSES WILL MAKE THE MODEL BETTER . BUT IF WE KEEP ON ADDING CLASSES SOON WE WILL BE OVER FITTING AND EVENTUALLY EACH OBJECT IN THE DATA SET WOULD BE A CLASS OF ITS OWN!!

IN THE ELBOW METHOD WE PLOT THE VARIANCE VS THE NUMBER OF CLASSES . THE GRAPH TURNS OUT TO LOOK LIKE “AN ELBOW” . DECREASING SHARPLY INITIALLY AND THEN FORMING AN “L ” SHAPE CURVE .THE SHARPNESS OF THE BEND DEPENDS ON THE PARTICULAR DATASET. AND THIS SHARP “BEND” POINT CORRESPONDS TO THE IDEAL K . WHY ? BECAUSE NOW FOR EVERY FURTHER INTRODUCTION OF NEW CLASS THE CHANGES IN THE CLASSIFICATIONS ARE MINIMAL . TAKE A LOOK AT THE GRAPH BELOW , THINGS WILL GET CLEAR WITH THIS :

K MEANS ELBOW METHOD MACHINE LEARNING

A PLOT OF AVERAGE DISPERSION(VARIANCE ) VS NUMBER OF CLASSES(K)

HAPPY CLASSIFYING!!!

LOGISTIC REGRESSION

LOGISTIC REGRESSION AND ITS ASSUMPTIONS

WHAT IS LOGISTIC REGRESSION , ITS ASSUMPTIONS AND USES IN MACHINE LEARNING ,ALGORITHMS

THE WORD “REGRESSION ” IN LOGISTIC REGRESSION IS A MISNOMER . A LINEAR REGRESSION MODEL WAS SUPPOSED TO PREDICT A VALUE BASED ON ITS TRAINING . A LOGISTIC REGRESSION MODEL IS USED IN CLASSIFICATION PROBLEMS . TO MAKE THIS LINE CLEAR WE NEED TO ADDRESS ONE QUESTION . WHAT IS CLASSIFICATION AND WHAT IS A GOOD WAY TO CLASSIFY THINGS . THERE ARE CERTAIN CASES WHERE CLASSIFYING THINGS IS RATHER TRIVIAL(AT LEAST FOR HUMANS ) . LETS DISCUSS THE INTUITION BEHIND LOGISTIC REGRESSION ASSUMPTIONS BEFORE GETTING TO THE MATH . FOR EXAMPLE YOU CAN EASILY TELL WHETHER SOMEONE WATER AND FIRE , A CAT FROM AN ELEPHANT , A CAR FROM A PEN . ITS JUST YES OR NO . A PROBLEM SIMPLY CONSISTING OF 2 CLASSES AND THAT CAN BE ANSWERED AS A YES OR NO .

NOW CONSIDER I ASK YOU A QUESTION ABOUT WHETHER OR NOT YOU LIKE A FOOD ITEM . HOW WILL YOUR ANSWER VARY THIS TIME FROM THE PREVIOUS CASES ?

SURELY THERE WOULD BE ITEMS YOU WOULD LOVE TO EAT AND SOME YOU WOULD STRAIGHTAWAY DENY , BUT FOR SOME FOOD ITEMS YOUR YOU WOULDN’T BE SO JUDGEMENTAL . SUPPOSE YOUR ANSWER GOES LIKE THIS : ” ITS NOT THAT I WOULD DIE IF I NOT EAT THAT BUT IF I GET A CHANCE I WOULD DEFINITELY TAKE A FEW BITES ” . YOU SEE THIS IS RATHER CONFUSING EVEN FOR A HUMAN LET ALONE ANY MACHINE . SO WE TAKE THE FOLLOWING APPROACH

PROBABILITY COMES TO THE RESCUE

FOR SUCH PROBLEMS , BE IT LIKING A MOVIE , A FOOD ITEM , A SONG , ITS ALWAYS BETTER TO DEAL WITH A CONTINUOUS RANGE RATHER THAN A BINARY ANSWER . SO THE QUESTION ” ON A SCALE OF 0 -1 , HOW MUCH DO YOU LIKE PASTA ?? “( DUH ! IS THAT A QUESTION ) NOW ALLOWS YOU TO EXPRESS YOUR LIKING IN A MUCH MORE ELABORATE WAY .

ANOTHER ADVANTAGE OF PROBABILITY IS THAT SUCH A DISTRIBUTION LETS YOU ESCAPE THE “HARSHNESS ” A BOOLEAN REPRESENTATION PRESENTS . LETS MAKE THIS POINT CLEAR . SUPPOSE SOMEONE SCORES 2 MOVIES ON A SCALE 0-1 . LET THE SCORES BE 0.49 AND 0.51 RESPECTIVELY . WHAT WOULD THE SAME SCORES LOOK ON LIKE ON A BINARY OUTPUT . ONE FILM QUALIFIES AS GOOD WHILE ANOTHER AS BAD (CONSIDERING 0.5 AS THE MIDWAY) .

SO ORIGINALLY EVEN THOUGH THE PERSON FOUND THE FILMS ALMOST SIMILAR (A DIFFERENCE OF 0.02) . A BINARY CLASSIFIER DOESN’T SHOW ANY MERCY !!!. ITS EITHER A YES OR A NO . THIS IS WHY PROBABILITY DISTRIBUTIONS ARE BETTER .

NOW WHY CANNOT WE USE LINEAR REGRESSION TO SOLVE A CLASSIFICATION PROBLEM . WE COULD HAVE PREDICTED A “PROBABILITY” VALUE THERE TOO ,RIGHT ? JUST USE THE RATINGS AS THE DEPENDENT VARIABLE , USE ONE HOT ENCODING FOR FEATURES LIKE PRESENCE OF AN ACTOR OR ABSENCE AND ISN’T THAT ENOUGH . THE ANSWER IS THAT SUCH ONE HOT ENCODING TAKES AWAY IMPORTANT PATTERNS IN THE DATA SET . WHILE ENCODING ( BAD , GOOD ,BEST ) AS (-1 ,0,1 ) MIGHT BE A GOOD OPTION . AS THE QUALITIES TOO ARE IN INCREASING ORDER , CAN WE ENCODE ( RABBIT , ELEPHANT , EAGLE ) AS ( -1 ,0 ,1 ) ? IS THE DIFFERENCE BETWEEN AN EAGLE AND A RABBIT OR AN ELEPHANT IS THE SAME ? WELL NO !! . ALSO EVEN FOR SIMPLE REGRESSION PROBLEMS A LINE IS A BAD CHOICE AS THERE COULD BE MANY ERROR POINTS .

LOGISTIC REGRESSION

FOR LOGISTIC REGRESSION WE USE A SIGMOID FUNCTION : WHICH LOOKS SOMETHING LIKE THIS :

SIGMOID FUNCTION LOGISTIC REGRESSION

GRADIENT REFERS TO THE SLOPE , NOTICE HOW FOR ALL REAL X THE OUTPUT A LIES BETWEEN [0-1]

NOW LETS GET TO THE MATH , THE WORD “LOGISTIC ” REFERS TO “LOGARITHMIC + ODDS (CHANCES) “

ODDS OF AN EVENT = PROBABILITY OF THE EVENT OCCURRING / ( 1- PROBABILITY OF THE EVENT OCCURRING )

SO IN LOGISTIC REGRESSION WE TRY TO FIND THE PROBABILITY OF BELONGING TO A CERTAIN CLASS .GIVEN AN INPUT INSTANCE X . WE WRITE THE CONDITIONAL PROBABILITY (Y=1 |X) =P(X) , WHERE “1” IS NOT A NUMBER BUT A CLASS . SO THE ODDS CAN BE WRITTEN AS P(X)/1-P(X) . OKAY , BUT WHAT DO WE LEARN ? IN LINEAR REGRESSION WE WERE LOOKING FOR THE BEST FIT LINE AND THE PARAMETERS WE WERE OPTIMISING WERE (M,C) .SLOPE AND INTERCEPT TO BE PRECISE .WHATS THE PARAMETER HERE :

WHAT ARE THE PARAMETERS

WE INTRODUCE A PARAMETER BETA IN THE SIGMOID FUNCTION > THIS BETA DECIDES TO THINGS ,

  1. AT WHAT VALUE OF X THE OUTPUT IS 0.5
  2. WHAT IS THE SLOPE VARIATION OF THE SIGMOID FUNCTION . FOR BETA TENDING TO INFINITY THE SIGMOID TURNS INTO A STEP FUNCTION (YES /NO) . SO THIS BETA IS WHAT WE NEED TO OPTIMISE ACCORDING TO OUR TRAINING DATA SET .
MATH OF LOGISTIC REGRESSION SIGMOID FUNCTION

THE FUNCTION WITH THE LEARNABLE PARAMETER BETA AND ITS LINEAR RELATION WITH “LOG ODDS”

AGAIN WE NEED TO DECIDE OUR LOSS FUNCTION !! WE USE BETA hat TO REPRESENT AN ESTIMATED BETA . LOGISTIC REGRESSION USES THE CONCEPT OF MAXIMUM LIKELIHOOD TO OPTIMISE BETA hat . OUR FUNCTION TRIES MAXIMISES THE PRODUCT OF ALL PROBABLITIES P(X) OF X IN CLASS 1 MULTIPLIED BY PRODUCTS OF ALL (1- P(X)) OF X IN CLASS 0. IN SIMPLE TERMS THIS APPROACH TRIES TO MAXIMIZE BETA hat FOR Y=1|X AND MINIMIZE FOR Y=0|X .

MAXIMIMUM LIKELIHOOD FUNCTION

THIS IS OUR LIKELIHOOD FUNCTION WHICH WE WANT TO MAXIMIZE ,THE THIRD EQAUTION TAKES THE LOG OF THE SECOND ONE

SIMPLIFYING THE ABOVE EQUATION WE REACH TO THE FOLLOWING EQUATION :

TRNSCENDENTAL EQUATION

SUCH AN EQUATION CONTAINING LOGS, EXPONENTS CANNOT BE SOLVED AND ARE KNOWN AS TRANSCENDENTAL EQUATIONS . BUT WE CAN FIND APPROXIMATE WAYS OF DOING SO!!

THE NEWTON RALPHSON METHOD (THE APPROXIMATION)

HERE WE USE THE TAYLOR SERIES EXPANSION OF THE MAX LIKELIHOOD FUNCTION THAT WE HAVE DERIVED . WE IGNORE THE NON SIGNIFICANT HIGHER POWERS AS A PART OF OUR LOGISTIC REGRESSION ASSUMPTIONS . THEN WE KEEP ITERATING AND UPDATING BETA UNTIL THE VALUE OF BETA CONVERGES AND FURTHER UPDATES ARE NOT AFFECTING IT . THIS UPDATING OF BETA USES TWO FACTORS , GRADIENTS AND THE HESSIAN MATRIX . IF YOU ARE NOT COMFORTABLE WITH THE VECTOR CALCULUS YOU CAN SKIP THIS SECTION . IN SIMPLE WORDS WE FIND BETA USING THIS APPROACH AND WE HAVE THE SIGMOID FUNCTION . GETTING BACK THIS IS HOW THE GRADIENT AND THE HESSIAN LOOK LIKE:

GRADIENT AND HESSIAN LOGISTIC REGRESSION ASSUMPTIONS

THE GRADIENT AND THE HESSIAN AND THEIR MATRIX REPRESENTATION RESPECTIVELY . W IS THE DIAGONAL MATRIX P(X)(1-P(X))

TAYLOR SERIES LOGISTIC REGRESSION ASSUMPTIONS

USING THE GRADIENTS AND HESSIAN WE ITERATE t TIMES SUCH THAT BETA CONVERGES AND HENCE WE GET OUR TRAINED SIGMOID FUNCTION!!!

HAPPY CLASSIFYING!!!!!

ANN RNN CNN

LINEAR REGRESSION

WHAT IS LINEAR REGRESSION , USES IN MACHINE LEARNING ,ALGORITHMS

ONE OF THE MOST COMMON PROBLEMS WE COME ACROSS IN DAILY LIFE IS PREDICTING VALUES LIKE PRICE OF A COMMODITY , AGE OF A PERSON , NUMBER OF YEARS NEEDED TO MASTER A SKILL ETC . AS A HUMAN WHAT IS YOUR APPROACH WHEN YOU TRY TO MAKE SUCH PREDICTIONS . WHAT ARE THE PARAMETERS YOU CONSIDER . A HUMAN BRAIN HAS NO DIFFICULTY IN REALISING WHETHER A CERTAIN PROBLEM IS LINEAR OR NOT . SUPPOSE I TELL YOU THAT A CERTAIN CLOTHING COMPANY SELLS 2 CLOTHES FOR 2K BUCKS , 4 CLOTHES FOR 4K BUCKS , BUT 8 CLOTHES FOR 32K BUCKS .

IMMEDIATELY YOUR BRAIN TELLS YOU THAT SURELY THE LAST 8 CLOTHES MUST HAVE BEEN OF DIFFERENT QUALITY , OR BELONGING TO A DIFFERENT BRAND MAKING IT DIFFERENT FROM THE OTHER 2 CLOTH GROUPS . BUT IF I STATE THAT THE LAST 8 CLOTHES WERE FOR 8K BUCKS , YOUR BRAIN SIGNALS A LINEAR RELATION THAT IT CAN MAKE OUT OF THESE .

MANY DAILY LIFE PROBLEMS ARE TAGGED AS “LINEAR” OUT OF COMMON SENSE . A FEW EXAMPLES ARE :

  1. PRICES OF FLATS ON THE SAME FLOOR OF AN APARTMENT WOULD BE LINEARLY PROPORTIONAL TO THE NUMBER OF ROOMS .
  2. THE RENT OF A CAR WOULD BE LINEARLY PROPORTIONAL TO THE DISTANCE YOU TRAVEL .

“BY LINEARLY VARYING WE DON’T MEAN THAT WHEN PLOTTED ALL THE DATA POINTS WOULD STRICTLY PASS THROUGH A SINGLE LINE , BUT WHICH SHOWS A TREND WHERE THE GROWTH OF THE INDEPENDENT FUNCTION CAN BE VIEWED AS SOME LINEAR FUNCTION OF THE DEPENDENT VARIABLE + SOME RANDOM NOISE .”

THE MATH

YOU MUST BE AWARE OF EUCLIDIAN DISTANCE BETWEEN A STRAIGHT LINE AND POINTS WHICH DO NOT PASS THROUGH THE SAME . OUR AIM IS TO FIND A MODEL THAT USES THE DATA THAT HAS BEEN PROVIDED TO FIND OUT PREDICTIONS ON THE INDEPENDENT VARIABLE IF A CERTAIN VALUE OF THE DEPENDANT VALUE IS PROVIDED .

PUTTING IT MATHEMATICALLY ,

FOR A GIVEN DATA SET S –>{A:B} , WHERE A IS THE INDEPENDENT VARIABLE AND B IS THE CORRESPONDING DEPENDENT ONE FIND THE BEST PAIR (M,C ) SUCH THAT THE AVERAGE OF SUM OF SQUARES OF THE DIFFERENCE IN Y COORDINATES FOR EVERY B AND THE CORRESPONDING Y ON THE THE LINE Y=MA+C IS MINIMISED. WHERE THE AVERAGE IS TAKEN OVER THE NUMBER OF POINTS .

THE LOSS FUNCTION

NOW WE KNOW WHAT WE NEED TO MINIMIZE , THE VERY PARTICULAR QUANTITY IS TERMED AS “LOSS FUNCTION” . IT IS A MEASURE OF HOW GOOD YOUR MODEL IS FITTED TO THE TRAINING DATA . LETS SEE HOW SOME OF THE POSSIBLE ERROR FUNCTIONS THAT ARE USED LOOK LIKE :

LOSS FUNCTION IN MACHINE LEARNING

WHERE et REFERS TO THE DIFFERENCE OF THE Y COORDINATE OF A CERTAIN DATA POINT AND THE PREDICTED Y VALUE FOR THE SAME , N= TOTAL NUMBER OF DATA POINTS

WE CONSIDER DISTANCES SO THAT POSITIVE AND NEGATIVE COORDINATE DIFFERENCES DO NOT CANCEL OUT . ALSO ONE ANOTHER REGULARLY USED LOSS FUNCTION IS RMLSE : (ROOT MEAN SQUARE LOGARITHMIC ERROR)

RMSLE

WHERE Yi ,Y hat ARE THE ACTUAL AND THE PREDICTED VALUES

RMS VS RMSLE

THE L IN THE RMSLE STANDS FOR “LOGARITHMIC ” AND THIS IS PREFERRED IF CERTAIN DATA POINTS HAVE AN EXPONENTIAL VARIATION , HENCE TAKING THE LOG FIRST WOULD SUBSTANTIALLY REDUCE THE EFFECT OF A POSSIBLE OUTLIER . BELOW IS A REPRESENTATION SUMMING UP HOW THE SCENARIO LOOKS LIKE . THE DATA POINTS ARE IN BLUE ,THE BEST FIT LINE PASSING THROUGH THEM , NOTE HOW YOU CAN SEE A “LINEAR RELATION ” BETWEEN THE DATA POINTS . SUCH CAPABILITY OF “SEEING ” A DATA SET’S BEHAVIOUR IS LIMITED TO HUMANS AND USING THIS INTUITION WE CHOSE TO FIND A “BEST FIT LINE ” RATHER THAN A PARABOLA OR ANY OTHER CURVE . SUCH BIAS TOWARDS A CERTAIN CLASSIFICATION DUE TO OUR CAPABILITIES IS CALLED “INDUCTIVE BIAS “

LINEAR REGRESSION

SOME MORE MATH

SUPPOSE FOR A SET –>{X:Y} WE HAVE WE WANT TO CALCULATE THE ESTIMATED FUNCTION y(hat ) AS SOME FUNCTION OF X . (REMEMBER A HAT ON TOP OF ANY VARIABLE MEANS IT IS AN ESTIMATE , NOT THE REAL VALUE , WE ALWAYS MAKE MODELS THAT ARE TRYING TO ESTIMATE A FUNCTION WHICH IS IN THEORY UNKNOWN) IN CASE OF LINEAR REGRESSION THIS CAN BE REPRESENTED BY THE SECOND EQUATION IN THE FIGURE :

NOW SUBSTITUTING THE VALUES IN RMSE LOSS FUNCTION WE GET:

LINEAR REG DIFFERENTIATION

SO GIVEN THE ABOVE EQUATION THIS IS WHAT WE TEND TO MINIMIZE NOW ,DIFFERENTIATING THIS W.R.T BETA 1 AND EQUATING IT TO ZERO, WE CAN GET THE FOLLOWING RESULTS

FOLLOWING ARE THE VALUES OF THE VARIABLES WE NEEDED TO FIND :

THERE ARE VARIOUS WAYS WE CAN OPTIMISE THE ABOVE MODEL TO AVOID OVER FITTING . REGULARISATION TECHNIQUES LIKE RIDGE REGRESSION , LASSO AND ELASTINETS (COMBINATION OF BOTH ) ARE USED WHERE WE PENALISE MODELS THAT TEND TO OVER FIT . THIS IS DONE USING DIFFERENT LOSS FUNCTIONS THAN THE ONES WE HAVE USED HERE ! THE DIFFERENCE ARISES FROM INTRODUCING ADDITIONAL TERMS IN THE ALREADY DISCUSSED LOSS FUNCTION

DO LEAVE COMMENTS /QUERIES …………………..