UNDERSTANDING HOW USING “MOMENTUM ” HELPS TO MAKE LEARNING EFFICIENT IN LOW GRADIENT REGIONS
AFTER LEARNING WHAT GRADIENT DESCENT MEANS , WE TRY TO ADDRESS ONE OF THE MAJOR PROBLEMS A SIMPLE WEIGHT UPDATING ALGORITHM FACES AND LATER WE DISCUSS HOW WE CAN SOLVE THE ISSUE AND HENCE MAKE THE LEARNING EFFICIENT . BUT BEFORE WE TRY TO UNDERSTAND THE PROBLEM ITSELF !!
WHAT WERE THE PARAMETERS AFFECTING THE UPDATE OF WEIGHTS IN THE GRADIENT DESCENT ALGORITHM . FOR ANY ITERATION , THE UPDATE DEPENDED ON ONLY 2 QUANTITIES :
- Gradient
- learning rate

THE GRADIENT DESCENT RULE WHERE ALPHA IS THE LEARNING RATE , J IS THE LOSS FUNCTION , Wij THE jth WEIGHT OF THE ith NEURON OF THE layer specified .
THE PROBLEM
Ever felt like just laying on the bed dunking your head in the pillow and doing nothing because you “lack motivation ” . Well this is the kind of problem our simple weight updating algorithm faces ! . The “motivation ” which drives every iteration is the gradient . How big an update will be depends on how large the gradient is . Suppose you enter a region on the error surface where the gradient is too small , almost a flat surface . This would make the product of (learning rate ) and (gradient ) too small and the weights wont get updated enough . Hence the next forward iteration would practically lead to produce the same output ( since the weights have not changed much ) and this would cause a “halt” in the learning process . Your network becomes stagnant and lots of iterations are wasted before you could escape the flattish low gradient portion of the error surface .
THE SOLUTION , USING “MOMENTUM “
Imagine yourself driving a car and trying to find a location . You start by asking a foodstall owner the directions (analogous to gradients) . He directs you towards the east . You slowly start driving towards east . At the next stall (analogous to the next iteration ) you again ask for directions and yet again he directs you towards the east . Now confident enough you start driving a little faster . And if a third stall owner also directs you towards east you atleast become sure that you are on the right path . So multiple sources pointing in the same direction helped you to get “momentum” .
In gradient descent how can we use this concept ? Following is the new algorithm that incorporates such “momentum”
At any instant t , the weight update not only depends on the current gradient and learning rate , but also on the history of the iterations , where the iterations of the near past are more important . This means if multiple iterations are pointing towards the same direction , the weight update would be more . below we see how to achieve this mathematically :
THE NEW WEIGHT UPDATING ALGORITHM

where gamma lies between 0 and 1 .
the above relation shows that the update at iteration t depends on the current gradient times learning rate + ( a part of the previous update ) . lets see how a few iterations using the above “momentum ” based approach looks like :

ANY UPDATE t IS THE EXPONENTIALLY WEIGHTED AVERAGE OF THE PAST UPDATES .
CONCLUSION
THIS MOMENTUM BASED APPROACH REDUCES THE PROBLEM OF “STAGNANCY ” IN LOW GRADIENT AREAS VERY EFFICIENTLY . BUT IT HAS ITS OWN SLIGHT DRAWBACK . A LITTLE ONE . BECAUSE WE ARE GAINING MOMENTUM AND HENCE MAKING LARGER UPDATES , WE MIGHT OVERSHOOT THE DESIRED LOWEST POINT AND THEN THIS WOULD LEAD TO AN OSCILLATORY BEHAVIOUR BEFORE WE COULD REACH THE DESIRED POINT . BECAUSE ONCE YOU OVERSHOOT , YOU AGAIN UPDATE WEIGHTS TO REACH THE MINIMA OF THE ERROR SURFACE . AGAIN DUE TO MOMENTUM YOU MIGHT OVERSHOOT (SUBSEQUENT OVERSHOOT S WOULD BE SMALLER ) .

LOSS FUNCTION PLOTTED AGAINST A WEIGHT . NOTICE HOW BIGGER UPDATES ARE MADE STARTING FROM 1 UPTO STEP 3, BUT THEN WE OVERSHOOT THE POINT OF DESIRED MINIMA .
lets try to visualise the problem on an error surface contour map whose axis are weight and bias respectively . in the map below the area in red stands for low gradient portions and area in blue stands for the high ones . the line in red shows the value of (w,b ) after every iteration . the initial weight and bias are set near (2,4).

the problem of overshooting the minima and oscillating near it can be easily visualised near the blue region
CAN THIS PROBLEM BE SOLVED ?
YES INDEED!!
IN THE NEXT ARTICLE WE DISCUSS THE NESTEROV MOMENTUM GRADIENT DESCENT ALGORITHM WHICH TRIES TO SOLVE THE PROBLEM RAISED ABOVE .
HAPPY LEARNING !
Add a Comment
You must be logged in to post a comment