WHAT IS LOGISTIC REGRESSION , ITS ASSUMPTIONS AND USES IN MACHINE LEARNING ,ALGORITHMS
THE WORD “REGRESSION ” IN LOGISTIC REGRESSION IS A MISNOMER . A LINEAR REGRESSION MODEL WAS SUPPOSED TO PREDICT A VALUE BASED ON ITS TRAINING . A LOGISTIC REGRESSION MODEL IS USED IN CLASSIFICATION PROBLEMS . TO MAKE THIS LINE CLEAR WE NEED TO ADDRESS ONE QUESTION . WHAT IS CLASSIFICATION AND WHAT IS A GOOD WAY TO CLASSIFY THINGS . THERE ARE CERTAIN CASES WHERE CLASSIFYING THINGS IS RATHER TRIVIAL(AT LEAST FOR HUMANS ) . LETS DISCUSS THE INTUITION BEHIND LOGISTIC REGRESSION ASSUMPTIONS BEFORE GETTING TO THE MATH . FOR EXAMPLE YOU CAN EASILY TELL WHETHER SOMEONE WATER AND FIRE , A CAT FROM AN ELEPHANT , A CAR FROM A PEN . ITS JUST YES OR NO . A PROBLEM SIMPLY CONSISTING OF 2 CLASSES AND THAT CAN BE ANSWERED AS A YES OR NO .
NOW CONSIDER I ASK YOU A QUESTION ABOUT WHETHER OR NOT YOU LIKE A FOOD ITEM . HOW WILL YOUR ANSWER VARY THIS TIME FROM THE PREVIOUS CASES ?
SURELY THERE WOULD BE ITEMS YOU WOULD LOVE TO EAT AND SOME YOU WOULD STRAIGHTAWAY DENY , BUT FOR SOME FOOD ITEMS YOUR YOU WOULDN’T BE SO JUDGEMENTAL . SUPPOSE YOUR ANSWER GOES LIKE THIS : ” ITS NOT THAT I WOULD DIE IF I NOT EAT THAT BUT IF I GET A CHANCE I WOULD DEFINITELY TAKE A FEW BITES ” . YOU SEE THIS IS RATHER CONFUSING EVEN FOR A HUMAN LET ALONE ANY MACHINE . SO WE TAKE THE FOLLOWING APPROACH
PROBABILITY COMES TO THE RESCUE
FOR SUCH PROBLEMS , BE IT LIKING A MOVIE , A FOOD ITEM , A SONG , ITS ALWAYS BETTER TO DEAL WITH A CONTINUOUS RANGE RATHER THAN A BINARY ANSWER . SO THE QUESTION ” ON A SCALE OF 0 -1 , HOW MUCH DO YOU LIKE PASTA ?? “( DUH ! IS THAT A QUESTION ) NOW ALLOWS YOU TO EXPRESS YOUR LIKING IN A MUCH MORE ELABORATE WAY .
ANOTHER ADVANTAGE OF PROBABILITY IS THAT SUCH A DISTRIBUTION LETS YOU ESCAPE THE “HARSHNESS ” A BOOLEAN REPRESENTATION PRESENTS . LETS MAKE THIS POINT CLEAR . SUPPOSE SOMEONE SCORES 2 MOVIES ON A SCALE 0-1 . LET THE SCORES BE 0.49 AND 0.51 RESPECTIVELY . WHAT WOULD THE SAME SCORES LOOK ON LIKE ON A BINARY OUTPUT . ONE FILM QUALIFIES AS GOOD WHILE ANOTHER AS BAD (CONSIDERING 0.5 AS THE MIDWAY) .
SO ORIGINALLY EVEN THOUGH THE PERSON FOUND THE FILMS ALMOST SIMILAR (A DIFFERENCE OF 0.02) . A BINARY CLASSIFIER DOESN’T SHOW ANY MERCY !!!. ITS EITHER A YES OR A NO . THIS IS WHY PROBABILITY DISTRIBUTIONS ARE BETTER .
NOW WHY CANNOT WE USE LINEAR REGRESSION TO SOLVE A CLASSIFICATION PROBLEM . WE COULD HAVE PREDICTED A “PROBABILITY” VALUE THERE TOO ,RIGHT ? JUST USE THE RATINGS AS THE DEPENDENT VARIABLE , USE ONE HOT ENCODING FOR FEATURES LIKE PRESENCE OF AN ACTOR OR ABSENCE AND ISN’T THAT ENOUGH . THE ANSWER IS THAT SUCH ONE HOT ENCODING TAKES AWAY IMPORTANT PATTERNS IN THE DATA SET . WHILE ENCODING ( BAD , GOOD ,BEST ) AS (-1 ,0,1 ) MIGHT BE A GOOD OPTION . AS THE QUALITIES TOO ARE IN INCREASING ORDER , CAN WE ENCODE ( RABBIT , ELEPHANT , EAGLE ) AS ( -1 ,0 ,1 ) ? IS THE DIFFERENCE BETWEEN AN EAGLE AND A RABBIT OR AN ELEPHANT IS THE SAME ? WELL NO !! . ALSO EVEN FOR SIMPLE REGRESSION PROBLEMS A LINE IS A BAD CHOICE AS THERE COULD BE MANY ERROR POINTS .
LOGISTIC REGRESSION
FOR LOGISTIC REGRESSION WE USE A SIGMOID FUNCTION : WHICH LOOKS SOMETHING LIKE THIS :

GRADIENT REFERS TO THE SLOPE , NOTICE HOW FOR ALL REAL X THE OUTPUT A LIES BETWEEN [0-1]
NOW LETS GET TO THE MATH , THE WORD “LOGISTIC ” REFERS TO “LOGARITHMIC + ODDS (CHANCES) “
ODDS OF AN EVENT = PROBABILITY OF THE EVENT OCCURRING / ( 1- PROBABILITY OF THE EVENT OCCURRING )
SO IN LOGISTIC REGRESSION WE TRY TO FIND THE PROBABILITY OF BELONGING TO A CERTAIN CLASS .GIVEN AN INPUT INSTANCE X . WE WRITE THE CONDITIONAL PROBABILITY (Y=1 |X) =P(X) , WHERE “1” IS NOT A NUMBER BUT A CLASS . SO THE ODDS CAN BE WRITTEN AS P(X)/1-P(X) . OKAY , BUT WHAT DO WE LEARN ? IN LINEAR REGRESSION WE WERE LOOKING FOR THE BEST FIT LINE AND THE PARAMETERS WE WERE OPTIMISING WERE (M,C) .SLOPE AND INTERCEPT TO BE PRECISE .WHATS THE PARAMETER HERE :
WHAT ARE THE PARAMETERS
WE INTRODUCE A PARAMETER BETA IN THE SIGMOID FUNCTION > THIS BETA DECIDES TO THINGS ,
- AT WHAT VALUE OF X THE OUTPUT IS 0.5
- WHAT IS THE SLOPE VARIATION OF THE SIGMOID FUNCTION . FOR BETA TENDING TO INFINITY THE SIGMOID TURNS INTO A STEP FUNCTION (YES /NO) . SO THIS BETA IS WHAT WE NEED TO OPTIMISE ACCORDING TO OUR TRAINING DATA SET .

THE FUNCTION WITH THE LEARNABLE PARAMETER BETA AND ITS LINEAR RELATION WITH “LOG ODDS”
AGAIN WE NEED TO DECIDE OUR LOSS FUNCTION !! WE USE BETA hat TO REPRESENT AN ESTIMATED BETA . LOGISTIC REGRESSION USES THE CONCEPT OF MAXIMUM LIKELIHOOD TO OPTIMISE BETA hat . OUR FUNCTION TRIES MAXIMISES THE PRODUCT OF ALL PROBABLITIES P(X) OF X IN CLASS 1 MULTIPLIED BY PRODUCTS OF ALL (1- P(X)) OF X IN CLASS 0. IN SIMPLE TERMS THIS APPROACH TRIES TO MAXIMIZE BETA hat FOR Y=1|X AND MINIMIZE FOR Y=0|X .

THIS IS OUR LIKELIHOOD FUNCTION WHICH WE WANT TO MAXIMIZE ,THE THIRD EQAUTION TAKES THE LOG OF THE SECOND ONE
SIMPLIFYING THE ABOVE EQUATION WE REACH TO THE FOLLOWING EQUATION :

SUCH AN EQUATION CONTAINING LOGS, EXPONENTS CANNOT BE SOLVED AND ARE KNOWN AS TRANSCENDENTAL EQUATIONS . BUT WE CAN FIND APPROXIMATE WAYS OF DOING SO!!
THE NEWTON RALPHSON METHOD (THE APPROXIMATION)
HERE WE USE THE TAYLOR SERIES EXPANSION OF THE MAX LIKELIHOOD FUNCTION THAT WE HAVE DERIVED . WE IGNORE THE NON SIGNIFICANT HIGHER POWERS AS A PART OF OUR LOGISTIC REGRESSION ASSUMPTIONS . THEN WE KEEP ITERATING AND UPDATING BETA UNTIL THE VALUE OF BETA CONVERGES AND FURTHER UPDATES ARE NOT AFFECTING IT . THIS UPDATING OF BETA USES TWO FACTORS , GRADIENTS AND THE HESSIAN MATRIX . IF YOU ARE NOT COMFORTABLE WITH THE VECTOR CALCULUS YOU CAN SKIP THIS SECTION . IN SIMPLE WORDS WE FIND BETA USING THIS APPROACH AND WE HAVE THE SIGMOID FUNCTION . GETTING BACK THIS IS HOW THE GRADIENT AND THE HESSIAN LOOK LIKE:

THE GRADIENT AND THE HESSIAN AND THEIR MATRIX REPRESENTATION RESPECTIVELY . W IS THE DIAGONAL MATRIX P(X)(1-P(X))

USING THE GRADIENTS AND HESSIAN WE ITERATE t TIMES SUCH THAT BETA CONVERGES AND HENCE WE GET OUR TRAINED SIGMOID FUNCTION!!!
HAPPY CLASSIFYING!!!!!
Add a Comment
You must be logged in to post a comment