HOW NESTEROV GRADIENT DESCENT INCREASES EFFICIENCY OF MOMENTUM BASED DESCENTS
THE PROBLEM THAT A SIMPLE MOMENTUM BASED GRADIENT DESCENT PRODUCES IS THAT IT BECAUSE OF ITS “MOMENTUM ” IT OVERSHOOTS THE POINT OF DESIRED MINIMA AND HENCE LEAD TO LOTS OF OSCILLATIONS BEFORE REACHING THE DESIRED POINT .
below you can see a contour map of an error surface . (b is the bias axis , w is the weight axis ). The red portion is the area of low gradients and the blue portion contains the desired minima . the red line shows the current point over iterations :
the iteration starts from the red portion on the top right . using momentum we reach the minima region fast but overshoot due to momentum , hence a lot of oscillations before reaching the desired minima .
NESTEROV ACCELERATED GRADIENT DESCENT
The solution to the momentum problem near minima regions is obtained by using nesterov accelerated weight updating rule . It is based on the philosophy of ” look before you leap ” . This is how we try to handle the problem :
- From momentum descent we know that any instant the update depends on (the accumulated past) +(the gradient at that point ) .
- So , now what we do is first only calculate the accumulated past gradient , then move , and then find the gradient at that point . if the gradient at that point has change sign it means you have crossed the point of minima (zero gradient) . This helps to avoid overshooting .
lets look at the algorithm to make the things clear:
update rule for NESTEROV ACCELERATED GRADIENT DESCENT:
As we can see we first update the weight and get the weight(w look ahead ) . Then from that point we calculate the gradients . And if this gradient turns out to be of a different sign , it means we have overshooted the point of minima . the plot below makes this clear .
as compared to momentum descent here we first calculate the w look ahead(4a) , since the sign of gradient changes( from negative to positive ) we avoid that step .
lets have a look on how this will look in a code and also lets compare the nesterov gradient descent plot along with the one we get from simple momentum descent . below on the plot ( bias vs weights ) of the error surface you can see the nesterov descent in blue . You can see how drastically the oscillations near the minima region reduces .