THE PROBLEM FACED DURING BACKPROPOGATION WHILE TRAINING NEURAL NETWORKS
BACKPROPOGATION REFERS TO THE METHOD USED FOR OPTIMISING THE WEIGHTS AND BIASES OF A NEURAL NETWORK LEARNING MODEL .
IT USES PARTIAL DERIVATIVES /GRADIENTS TO UPDATE THE WEIGHTS AFTER EVERY FORWARD CYCLE . FOLLOWING IS THE ALGORITHM USED ,ALSO KNOWN AS GRADIENT DESCENT ALGORITHM AS THE WEIGHTS ARE UPDATED USING THE MAXIMUM GRADIENT PRESENT AT THE PARTICULAR STATE . THE FOLLOWING IS THE ALGORITHM :

W ij Jth WEIGHT OF THE ith NEURON IN THE GIVEN LAYER
ALPHA IS THE LEARNING RATE
BEFORE SEEING WHAT PROBLEMS THE ABOVE SIMPLE APPROACH HAS , AND HOW TO REMOVE THAT READ THIS BLOG ON ACTIVATION FUNCTIONS . YOU CAN PROCEED IF YOU ALREADY ARE AWARE OF THEM . HAVE A QUICK LOOK IF YOU WANT TO BRUSH UP THE CONCEPTS .
THE PROBLEMS
SO THIS SIMPLE APPROACH USES 2 THINGS :
- GRADIENTS : THE GRADIENT MEASURES THE MAX SLOPE OF THE ERROR SURFACE AT ANY GIVEN POINT .
- LEARNING RATE : LEARNING RATE SPECIFIES HOW FAST A MODEL IS LEARNING OR HOW BIG A STEP IT IS TAKING PER ITERATION TRAVERSING THE ERROR SURFACE
LEARNING RATE IS A SIMPLE PARAMETER THAT DECIDES HOW LARGE A MODEL WILL TAKE A STEP TOWARDS THE POINT OF CONVERGENCE AFTER EVERY CYCLE . THE PROBLEM WITH SMALL LEARNING RATE IS THAT IF THE GRADIENT IS NOT BIG ENOUGH YOU MIGHT BE WASTING THOUSANDS OF ITERATIONS NEAR THAT PORTION OF THE ERROR SURFACE. IF THE LEARNING RATE IS TOO HIGHT YOU MIGHT OVERSHOOT FROM THE DESIRED POINT .
SECONDLY ,SUPPOSE IF THE GRADIENTS OF THE LOSS FUNCTION NEAR THE LAYERS THAT ARE CLOSER TO THE OUTPUT LAYER TURNS OUT TO BE LESS THAN ONE . SO USING THE CHAIN RULE AS WE PERFORM BACKPROPOGATION , THE VALUE OF GRADIENTS WITH RESPECT TO THE WEIGHTS WOULD START DIMINISHING AS WE MOVE TOWARDS THE INPUT LAYER . NOW SUPPOSE IF OUR NETWORK CONSISTS OF 50 HIDDEN LAYERS , THEN IF THE GRADIENT NEAR THE OUTPUT LAYER IS LESS THEN ONE , THE GRADIENTS AT THE LAYERS 1-5 WOULD BE VERY CLOSE TO ZERO .
WHAT WOULD THIS LEAD TO?
THE PROBLEM OF VANISHING GRADIENTS
SINCE IN THE ALGORITHM , THE WEIGHT UPDATION DEPENDS ON THE GRADIENT MULTIPLIED BY LEARNING RATE , IT TURNS OUT THAT SINCE GRADIENTS ARE VERY CLOSE TO ZERO , ALMOST VERY LITTLE TO NO UPDATION OCCURS AND THIS LEADS TO NO ‘LEARNING ” OF THE FIRST FEW LAYERS . ALL THIS HAPPENING BECAUSE THE GRADIENTS “VANISHED ” WITH THE INCREASE OF LENGTH OF NEURAL NETWORKS, HENCE THE NAME “VANISHING GRADIENTS ” BELOW ARE TWO NEURAL NETWORKS ,ONE CONTAINS A FEW HIDDEN LAYERS WHILE ONE CONTAINS MANY .

A SMALL NEURAL NETWORK CONSISTING OF JUST 2 HIDDEN LAYERS .

A LONG NEURAL NETWORK CONSISTING OF MANY HIDDEN LAYERS
THE SOLUTION
THE LEARNING RATE PROBLEM CAN BE SOLVED BY INITIALLY KEEPING THE LEARNING RATE HIGH , AND DECREASING IT AS WE REACH HIGHER NUMBER OF ITERATIONS .
THE VANISHING GRADIENT PROBLEM CAN BE SOLVED BY USING RELU ACTIVATION FUNCTION . THIS WILL MAKE THE DERIVATIVES OF THE WEIGHTS EITHER ZERO OR ONE . MOREOVER THE PROBLEM OF GETTING STUCKED IN REGIONS OF LOWER GRADIENTS WHICH WASTES A LOT OF ITERATIONS , CAN BE SOLVED BY USING MOMENTUM BASED GRADIENT DESCENTS WHICH INCORPORATES THE HISTORY OF THE TRAINING PROCESS . SOME OF THE OPTIMISATIONS ARE KNOWN AS ADAM AND ADAGRAD . YOU CAN READ ABOUT THEM IN THIS POST .
CONVOLUTIONAL NEURAL NETWORKS AND RECURRENT NEURAL NETWORKS ALSO FACE THE PROBLEM OF VANISHING GRADIENTS . SOLUTIONS LIKE LSTMs , GRUs COME TO THE RESCUE.
Add a Comment
You must be logged in to post a comment