## UNDERSTANDING HOW USING “MOMENTUM ” HELPS TO MAKE LEARNING EFFICIENT IN LOW GRADIENT REGIONS

AFTER LEARNING WHAT GRADIENT DESCENT MEANS , WE TRY TO ADDRESS ONE OF THE MAJOR PROBLEMS A SIMPLE WEIGHT UPDATING ALGORITHM FACES AND LATER WE DISCUSS HOW WE CAN SOLVE THE ISSUE AND HENCE MAKE THE LEARNING EFFICIENT . BUT BEFORE WE TRY TO UNDERSTAND THE PROBLEM ITSELF !!

WHAT WERE THE PARAMETERS AFFECTING THE UPDATE OF WEIGHTS IN THE GRADIENT DESCENT ALGORITHM . FOR ANY ITERATION , THE UPDATE DEPENDED ON ONLY 2 QUANTITIES :

- Gradient
- learning rate

### THE GRADIENT DESCENT RULE WHERE ALPHA IS THE LEARNING RATE , J IS THE LOSS FUNCTION , Wij THE jth WEIGHT OF THE ith NEURON OF THE layer specified .

## THE PROBLEM

Ever felt like just laying on the bed dunking your head in the pillow and doing nothing because you “lack motivation ” . Well this is the kind of problem our simple weight updating algorithm faces ! . The “motivation ” which drives every iteration is the gradient . How big an update will be depends on how large the gradient is . Suppose you enter a region on the error surface where the gradient is too small , almost a flat surface . This would make the product of (learning rate ) and (gradient ) too small and the weights wont get updated enough . Hence the next forward iteration would practically lead to produce the same output ( since the weights have not changed much ) and this would cause a “halt” in the learning process . Your network becomes stagnant and lots of iterations are wasted before you could escape the flattish low gradient portion of the error surface .

## THE SOLUTION , USING “MOMENTUM “

Imagine yourself driving a car and trying to find a location . You start by asking a foodstall owner the directions (analogous to gradients) . He directs you towards the east . You slowly start driving towards east . At the next stall (analogous to the next iteration ) you again ask for directions and yet again he directs you towards the east . Now confident enough you start driving a little faster . And if a third stall owner also directs you towards east you atleast become sure that you are on the right path . So multiple sources pointing in the same direction helped you to get “momentum” .

In gradient descent how can we use this concept ? Following is the new algorithm that incorporates such “momentum”

At any instant t , the weight update not only depends on the current gradient and learning rate , but also on the history of the iterations , where the iterations of the near past are more important . This means if multiple iterations are pointing towards the same direction , the weight update would be more . below we see how to achieve this mathematically :

## THE NEW WEIGHT UPDATING ALGORITHM

#### where gamma lies between 0 and 1 .

the above relation shows that the update at iteration t depends on the current gradient times learning rate + ( a part of the previous update ) . lets see how a few iterations using the above “momentum ” based approach looks like :

### ANY UPDATE t IS THE EXPONENTIALLY WEIGHTED AVERAGE OF THE PAST UPDATES .

## CONCLUSION

THIS MOMENTUM BASED APPROACH REDUCES THE PROBLEM OF “STAGNANCY ” IN LOW GRADIENT AREAS VERY EFFICIENTLY . BUT IT HAS ITS OWN SLIGHT DRAWBACK . A LITTLE ONE . BECAUSE WE ARE GAINING MOMENTUM AND HENCE MAKING LARGER UPDATES , WE MIGHT OVERSHOOT THE DESIRED LOWEST POINT AND THEN THIS WOULD LEAD TO AN OSCILLATORY BEHAVIOUR BEFORE WE COULD REACH THE DESIRED POINT . BECAUSE ONCE YOU OVERSHOOT , YOU AGAIN UPDATE WEIGHTS TO REACH THE MINIMA OF THE ERROR SURFACE . AGAIN DUE TO MOMENTUM YOU MIGHT OVERSHOOT (SUBSEQUENT OVERSHOOT S WOULD BE SMALLER ) .

### LOSS FUNCTION PLOTTED AGAINST A WEIGHT . NOTICE HOW BIGGER UPDATES ARE MADE STARTING FROM 1 UPTO STEP 3, BUT THEN WE OVERSHOOT THE POINT OF DESIRED MINIMA .

lets try to visualise the problem on an error surface contour map whose axis are weight and bias respectively . in the map below the area in red stands for low gradient portions and area in blue stands for the high ones . the line in red shows the value of (w,b ) after every iteration . the initial weight and bias are set near (2,4).

### the problem of overshooting the minima and oscillating near it can be easily visualised near the blue region

CAN THIS PROBLEM BE SOLVED ?

YES INDEED!!

IN THE NEXT ARTICLE WE DISCUSS THE NESTEROV MOMENTUM GRADIENT DESCENT ALGORITHM WHICH TRIES TO SOLVE THE PROBLEM RAISED ABOVE .

HAPPY LEARNING !