photo-1578741837482-285740b1a261

MOMENTUM BASED GRADIENT DESCENT

UNDERSTANDING HOW USING “MOMENTUM ” HELPS TO MAKE LEARNING EFFICIENT IN LOW GRADIENT REGIONS

AFTER LEARNING WHAT GRADIENT DESCENT MEANS , WE TRY TO ADDRESS ONE OF THE MAJOR PROBLEMS A SIMPLE WEIGHT UPDATING ALGORITHM FACES AND LATER WE DISCUSS HOW WE CAN SOLVE THE ISSUE AND HENCE MAKE THE LEARNING EFFICIENT . BUT BEFORE WE TRY TO UNDERSTAND THE PROBLEM ITSELF !!

WHAT WERE THE PARAMETERS AFFECTING THE UPDATE OF WEIGHTS IN THE GRADIENT DESCENT ALGORITHM . FOR ANY ITERATION , THE UPDATE DEPENDED ON ONLY 2 QUANTITIES :

  1. Gradient
  2. learning rate
NEURAL NETWORKS BACKPROPOGATION

THE GRADIENT DESCENT RULE WHERE ALPHA IS THE LEARNING RATE , J IS THE LOSS FUNCTION , Wij THE jth WEIGHT OF THE ith NEURON OF THE layer specified .

THE PROBLEM

Ever felt like just laying on the bed dunking your head in the pillow and doing nothing because you “lack motivation ” . Well this is the kind of problem our simple weight updating algorithm faces ! . The “motivation ” which drives every iteration is the gradient . How big an update will be depends on how large the gradient is . Suppose you enter a region on the error surface where the gradient is too small , almost a flat surface . This would make the product of (learning rate ) and (gradient ) too small and the weights wont get updated enough . Hence the next forward iteration would practically lead to produce the same output ( since the weights have not changed much ) and this would cause a “halt” in the learning process . Your network becomes stagnant and lots of iterations are wasted before you could escape the flattish low gradient portion of the error surface .

THE SOLUTION , USING “MOMENTUM “

Imagine yourself driving a car and trying to find a location . You start by asking a foodstall owner the directions (analogous to gradients) . He directs you towards the east . You slowly start driving towards east . At the next stall (analogous to the next iteration ) you again ask for directions and yet again he directs you towards the east . Now confident enough you start driving a little faster . And if a third stall owner also directs you towards east you atleast become sure that you are on the right path . So multiple sources pointing in the same direction helped you to get “momentum” .

In gradient descent how can we use this concept ? Following is the new algorithm that incorporates such “momentum”

At any instant t , the weight update not only depends on the current gradient and learning rate , but also on the history of the iterations , where the iterations of the near past are more important . This means if multiple iterations are pointing towards the same direction , the weight update would be more . below we see how to achieve this mathematically :

THE NEW WEIGHT UPDATING ALGORITHM

MOMENTUM BASED GRADIENT DESCENT

where gamma lies between 0 and 1 .

the above relation shows that the update at iteration t depends on the current gradient times learning rate + ( a part of the previous update ) . lets see how a few iterations using the above “momentum ” based approach looks like :

momentum based gradient descent

ANY UPDATE t IS THE EXPONENTIALLY WEIGHTED AVERAGE OF THE PAST UPDATES .

CONCLUSION

THIS MOMENTUM BASED APPROACH REDUCES THE PROBLEM OF “STAGNANCY ” IN LOW GRADIENT AREAS VERY EFFICIENTLY . BUT IT HAS ITS OWN SLIGHT DRAWBACK . A LITTLE ONE . BECAUSE WE ARE GAINING MOMENTUM AND HENCE MAKING LARGER UPDATES , WE MIGHT OVERSHOOT THE DESIRED LOWEST POINT AND THEN THIS WOULD LEAD TO AN OSCILLATORY BEHAVIOUR BEFORE WE COULD REACH THE DESIRED POINT . BECAUSE ONCE YOU OVERSHOOT , YOU AGAIN UPDATE WEIGHTS TO REACH THE MINIMA OF THE ERROR SURFACE . AGAIN DUE TO MOMENTUM YOU MIGHT OVERSHOOT (SUBSEQUENT OVERSHOOT S WOULD BE SMALLER ) .

LOSS FUNCTION PLOTTED AGAINST A WEIGHT . NOTICE HOW BIGGER UPDATES ARE MADE STARTING FROM 1 UPTO STEP 3, BUT THEN WE OVERSHOOT THE POINT OF DESIRED MINIMA .

lets try to visualise the problem on an error surface contour map whose axis are weight and bias respectively . in the map below the area in red stands for low gradient portions and area in blue stands for the high ones . the line in red shows the value of (w,b ) after every iteration . the initial weight and bias are set near (2,4).

the problem of overshooting the minima and oscillating near it can be easily visualised near the blue region

CAN THIS PROBLEM BE SOLVED ?

YES INDEED!!

IN THE NEXT ARTICLE WE DISCUSS THE NESTEROV MOMENTUM GRADIENT DESCENT ALGORITHM WHICH TRIES TO SOLVE THE PROBLEM RAISED ABOVE .

HAPPY LEARNING !

VANISHING GRADIENTS IN NEURAL NETWORKS

VANISHING GRADIENTS IN NEURAL NETWORKS

THE PROBLEM FACED DURING BACKPROPOGATION WHILE TRAINING NEURAL NETWORKS BACKPROPOGATION REFERS TO THE METHOD USED FOR OPTIMISING THE WEIGHTS AND BIASES OF A NEURAL NETWORK LEARNING MODEL . IT USES PARTIAL DERIVATIVES /GRADIENTS TO UPDATE THE WEIGHTS AFTER EVERY FORWARD CYCLE . FOLLOWING IS THE ALGORITHM USED ,ALSO KNOWN AS GRADIENT DESCENT ALGORITHM AS THE […]

NESTEROV ACCELERATED DESCENT

WHAT IS BACKPROPAGATION IN NEURAL NETWORKS?

HOW DOES A NEURAL NETWORK MODEL USES BACKPROPAGATION TO TRAIN ITSELF TO GET CLOSER TO THE DESIRED OUTPUT?

IF YOU KNOW SUPERVISED MACHINE LEARNING YOU MUST HAVE DONE FEATURE ENGINEERING , SELECTING WHICH FEATURES ARE MORE IMPORTANT WHICH CAN BE IGNORED . SUPPOSE YOU WANTED TO CREATE A SUPERVISED LEARNING MACHINE LEARNING MODEL USING LINEAR REGRESSION FOR PREDICTING PRICE OF A GIVEN HOUSE . AS A HUMAN YOU WOULD BE SURE THAT THE COLOUR OF THE HOUSE WONT BE AFFECTING THE PRICE AS MUCH AS THE NUMBER OF ROOMS , FLOOR AREA WILL . THIS IS CALLED FEATURE SELECTION / ENGINEERING . SO WHILE CREATING A MODEL YOU WONT CHOOSE “COLOUR ” AS A DECIDING FACTOR . THINGS ARE DIFFERENT IF YOU TRAIN A NEURAL NETWORK FOR THE VERY SAME PURPOSE . WHATS DIFFERENT IS THAT HERE THE NEURAL NETWORK ITSELF DECIDES HOW IMPORTANT A CERTAIN INPUT IS IN PREDICTING THE OUTPUT . BUT HOW? LETS SEE THE MATHEMATICS INVOLVED , THEN WE WILL HEAD TOWARDS BACKPROPOGATION.

THE ROLE OF DERIVATIVES

SUPPOSE YOU HAVE 2 FEATURES X AND Y , AND THE DEPENDENT VARIABLE IS Z . FROM THE INFORMATION ABOUT THEM WE KNOW THAT ONE UNIT CHANGE IN X PRODUCES A CHANGE OF 2 PERCENT IN Z , ON THE OTHER HAND ONE UNIT VARIATION IN Y PRODUCES A VARIATION OF 30 PERCENT IN Z . SO FOR THE FEATURES (X,Y ) WHICH IS MORE IMPORTANT ? IT IS CLEARLY Y. SO HOW DID YOU COME UP WITH THE ANSWER . IN LAYMAN TERMS ” CHANGE IN OUTPUT WITH RESPECT TO A FEATURE” . MATHEMATICALLY PARTIAL DERIVATIVE OF THE OUTPUT WITH RESPECT TO THE FEATURE . AND THIS STATEMENT DEMANDS THAT ALL FEATURES THAT ARE PASSED TO A NEURAL NET ARE JUST NUMBERS . SO FEATURES LIKE “GOOD” /” BAD” WONT DO. SO IN CASE YOU HAVE FEATURES THAT ARE NOT NUMERIC HERE THE WAY TO HANDLE THEM :

  1. SUPPOSE YOU HAVE A FEATURE WHICH COMPRISES OF THREE POSSIBLE CHOICES .
  2. BAD , GOOD AND EXCELLENT .
  3. YOU ASSOCIATE WITH EACH FEATURE A NUMBER , LIKE 0 STANDS FOR BAD , 1 FOR GOOD AND 2 FOR EXCELLENT .
  4. USING THE PANDAS FUNCTION YOU CAN REPLACE THE STRING FEATURES WITH THE CORRESPONDING NUMBERS . YOU CAN EITHER MODIFY THE EXISTING COLUMN ITSELF , OR ELSE YOU CAN CREATE A NEW COLUMN FOR THE SAME.

THE LOSS FUNCTION

THE PURPOSE OF BACKPROPOGATION IS TO MAKE THE MODEL BETTER . SO WHAT IS A MEASURE OF HOW BAD A MODEL IS . HOW DO WE DECIDE HOW TO MAKE THINGS BETTER . THIS IS DONE USING A LOSS FUNCTION . THE LOSS FUNCTION DEPENDS ON WHAT KIND OF PROBLEM YOU ARE DEALING WITH . IT GIVES YOU A NUMERIC VALUE OF HOW ” FAR” YOU ARE FROM THE DESIRED OUTPUT . LETS TAKE THE EXAMPLE OF A SIMPLE LINEAR REGRESSION PROBLEM . THE AIM IS TO FIND A LINE THAT BEST SATISFIES THE DATA POINTS , SINCE A LINE CANNOT PASS THROUGH EVERY DATA POINT THERE MUST BE SOME QUANTITY THAT MUST BE OPTIMISED . WHAT IS THE PERFECT LINE ? ONE APPROACH IS

THE LINE WHOSE SUM OF DISTANCES FROM ALL THE POINTS IS MINIMUM IS THE BEST FIT LINE . BASICALLY FOR ALL THE POSSIBLE LINES Y=MX + C , YOU FIND THE VALUES OF (M,C) FOR WHICH THE SUM OF DISTANCES ARE MINIMISED .

THIS CAN BE SUMMARISED MATHEMATICALLY AS THE FOLLOWING :

FOR A DATA SET WITH M POINTS , THE RMS (ROOT MEAN SQUARE ) LOSS FUNCTION IS

THE WEIGHT UPDATING RULE

NOW YOU ARE EQUIPPED WITH THE KNOWLEDGE OF LOSS FUNCTIONS AND DERIVATIVES . NOW LETS DISCUSS WHAT WE ARE LOOKING TO OPTIMISE IN A NEURAL NETWORK . LETS START BY DEFINING A NEURAL NETWORK :

SUPPOSE THERE ARE N INPUT LAYER FEATURES , K HIDDEN LAYERS WITH M NEURONS AND AN OUTPUT LAYER CONSISTING OF 2 NEURONS . HOW DO WE BEGIN THE TRAINING PROCESS . FOLLOWING ARE THE STEPS :

  1. INITIALIZE RANDOM WEIGHTS : ASSIGN RANDOM NUMBERS AS THE WEIGHTS AND BIASES FOR ALL THE HIDDEN LAYERS .
  2. PERFORM A FORWARD PROPAGATION . IT MEANS USING THE WEIGHTS AND BIASES , SIMPLY CALCULATE AN OUTPUT.
  3. CALCULATE THE LOSS FUNCTION . THIS IS WHAT WE NEED TO OPTIMIZE

NOW WE ARE READY TO PERFORM BACKPROPAGATION .

“BACKPROPAGATION REFERS TO THE UPDATING OF WEIGHTS SLIGHTLY DEPENDING ON HOW FAR THE MODEL IS FROM THE DESIRED OUTPUT AND HOW LONG THE TRAINING ALGORITHM HAS BEEN RUNNING .”

SO WE UPDATE WEIGHTS . NOW FOR A BIG NETWORK THERE ARE MILLIONS OF WEIGHTS , AND HOW DOES THE NETWORK KNOW HOW DOES A CERTAIN WEIGHT OUT OF THOSE MILLIONS AFFECT THE OUTPUT . YOU GUESSED IT RIGHT!! DERIVATIVES . AND PARTICULARLY PARTIAL DERIVATIVES . WE REPRESENT THE DERIVATIVE OF THE OUTPUT AS A CHAIN RULE STARTING FROM THE INPUT WEIGHTS . LETS SEE HOW ONE OF THESE MILLIONS OF WEIGHTS ARE UPDATED.

BACKPROPOGATION AND WEIGHT UPDATING
W LAYER IS THE WEIGHT MATRIX ELEMENT OF A SPECIFIC LAYER
i REFERS THE THE ith NEURON OF THAT LAYER
j REFERS TO THE jth WEIGHT FROM THE ith NEURON
ALPHA IS THE LEARNING RATE
J IS THE LOSS FUNCTION

SO FOLLOWING IS THE WEIGHT UPDATE RULE :

1) CALCULATE THE GRADIENT OF THE LOSS FUNCTION WITH RESPECT TO THAT WEIGHT AND BIAS .

2) . DECIDE A LEARNING RATE . THIS RATE MIGHT STAY CONSTANT OR VARY WITH TIME . THE LEARNING RATE SUGGESTS HOW MUCH A MODEL LEARNS OR UPDATES ITS WEIGHT IN A SINGLE BACKPROPAGATION CYCLE .

3) UPDATE THE WEIGHTS . ALL OF THEM OR IN BATCHES . PERFORM FORWARD PROPAGATION AND COMPUTE THE LOSS FUNCTION AGAIN.

4) REPEAT STEP 1 UNTIL GRADIENT –> 0 , THAT IS NO MORE WEIGHT UPDATES .

OPTIMISING BACKPROPOGATION

THIS IS WHAT BACKPROPOGATION MEANS , GOING BACK ,ASKING EACH AND EVERY WEIGHT HOW MUCH IT MATTERS IN DECIDING THE OUTPUT , UPDATE THE WEIGHTS . REPEAT!!!! THERE ARE VARIOUS METHODS FOR OPTIMISIZING THIS STEP . THE MAJOR PROBLEMS FACED ARE THE PROBLEM OF VANISHING GRADIENTS AND EXPLODING GRADIENTS . SUCH PROBLEMS GET IN OUR WAY WHEN THE NETWORK IS LONG OR CONTAINS MANY HIDDEN LAYERS . THE SIMPLE BACKPROPAGATION ALGORITHM THAT USES GRADIENT DESCENT NEEDS TO BE OPTIMISED FURTHER. MANY ALGORITHMS LIKE ADAM , ADAGRAD MAKE THIS POSSIBLE. THESE ALGORITHMS HELP ADJUST THE LEARNING RATE AS THE NUMBER OF ITERATIONS INCREASE . ADAM AND ADAGRAD HELPS IN THE FOLLOWING WAYS :

  1. KEEPING A TRACK OF THE PAST LEARNING HELPS TO OPTIMISE THE LEARNING
  2. YOU DONT GET STUCK IN AREAS WHERE THE GRADIENT IS TOO LESS TO MAKE PROGRESS , ADAM TAKES CARE THAT WHILE USING MOMENTUM YOU DONT OVERSHOOT THE LOWEST GRADIENT POINT .