# VANISHING GRADIENTS IN NEURAL NETWORKS

## THE PROBLEM FACED DURING BACKPROPOGATION WHILE TRAINING NEURAL NETWORKS

BACKPROPOGATION REFERS TO THE METHOD USED FOR OPTIMISING THE WEIGHTS AND BIASES OF A NEURAL NETWORK LEARNING MODEL .

IT USES PARTIAL DERIVATIVES /GRADIENTS TO UPDATE THE WEIGHTS AFTER EVERY FORWARD CYCLE . FOLLOWING IS THE ALGORITHM USED ,ALSO KNOWN AS GRADIENT DESCENT ALGORITHM AS THE WEIGHTS ARE UPDATED USING THE MAXIMUM GRADIENT PRESENT AT THE PARTICULAR STATE . THE FOLLOWING IS THE ALGORITHM :

BEFORE SEEING WHAT PROBLEMS THE ABOVE SIMPLE APPROACH HAS , AND HOW TO REMOVE THAT READ THIS BLOG ON ACTIVATION FUNCTIONS . YOU CAN PROCEED IF YOU ALREADY ARE AWARE OF THEM . HAVE A QUICK LOOK IF YOU WANT TO BRUSH UP THE CONCEPTS .

## THE PROBLEMS

SO THIS SIMPLE APPROACH USES 2 THINGS :

1. GRADIENTS : THE GRADIENT MEASURES THE MAX SLOPE OF THE ERROR SURFACE AT ANY GIVEN POINT .
2. LEARNING RATE : LEARNING RATE SPECIFIES HOW FAST A MODEL IS LEARNING OR HOW BIG A STEP IT IS TAKING PER ITERATION TRAVERSING THE ERROR SURFACE

LEARNING RATE IS A SIMPLE PARAMETER THAT DECIDES HOW LARGE A MODEL WILL TAKE A STEP TOWARDS THE POINT OF CONVERGENCE AFTER EVERY CYCLE . THE PROBLEM WITH SMALL LEARNING RATE IS THAT IF THE GRADIENT IS NOT BIG ENOUGH YOU MIGHT BE WASTING THOUSANDS OF ITERATIONS NEAR THAT PORTION OF THE ERROR SURFACE. IF THE LEARNING RATE IS TOO HIGHT YOU MIGHT OVERSHOOT FROM THE DESIRED POINT .

SECONDLY ,SUPPOSE IF THE GRADIENTS OF THE LOSS FUNCTION NEAR THE LAYERS THAT ARE CLOSER TO THE OUTPUT LAYER TURNS OUT TO BE LESS THAN ONE . SO USING THE CHAIN RULE AS WE PERFORM BACKPROPOGATION , THE VALUE OF GRADIENTS WITH RESPECT TO THE WEIGHTS WOULD START DIMINISHING AS WE MOVE TOWARDS THE INPUT LAYER . NOW SUPPOSE IF OUR NETWORK CONSISTS OF 50 HIDDEN LAYERS , THEN IF THE GRADIENT NEAR THE OUTPUT LAYER IS LESS THEN ONE , THE GRADIENTS AT THE LAYERS 1-5 WOULD BE VERY CLOSE TO ZERO .

### THE PROBLEM OF VANISHING GRADIENTS

SINCE IN THE ALGORITHM , THE WEIGHT UPDATION DEPENDS ON THE GRADIENT MULTIPLIED BY LEARNING RATE , IT TURNS OUT THAT SINCE GRADIENTS ARE VERY CLOSE TO ZERO , ALMOST VERY LITTLE TO NO UPDATION OCCURS AND THIS LEADS TO NO ‘LEARNING ” OF THE FIRST FEW LAYERS . ALL THIS HAPPENING BECAUSE THE GRADIENTS “VANISHED ” WITH THE INCREASE OF LENGTH OF NEURAL NETWORKS, HENCE THE NAME “VANISHING GRADIENTS ” BELOW ARE TWO NEURAL NETWORKS ,ONE CONTAINS A FEW HIDDEN LAYERS WHILE ONE CONTAINS MANY .

## THE SOLUTION

THE LEARNING RATE PROBLEM CAN BE SOLVED BY INITIALLY KEEPING THE LEARNING RATE HIGH , AND DECREASING IT AS WE REACH HIGHER NUMBER OF ITERATIONS .

THE VANISHING GRADIENT PROBLEM CAN BE SOLVED BY USING RELU ACTIVATION FUNCTION . THIS WILL MAKE THE DERIVATIVES OF THE WEIGHTS EITHER ZERO OR ONE . MOREOVER THE PROBLEM OF GETTING STUCKED IN REGIONS OF LOWER GRADIENTS WHICH WASTES A LOT OF ITERATIONS , CAN BE SOLVED BY USING MOMENTUM BASED GRADIENT DESCENTS WHICH INCORPORATES THE HISTORY OF THE TRAINING PROCESS . SOME OF THE OPTIMISATIONS ARE KNOWN AS ADAM AND ADAGRAD . YOU CAN READ ABOUT THEM IN THIS POST .

CONVOLUTIONAL NEURAL NETWORKS AND RECURRENT NEURAL NETWORKS ALSO FACE THE PROBLEM OF VANISHING GRADIENTS . SOLUTIONS LIKE LSTMs , GRUs COME TO THE RESCUE.