fdfef4c9f9aea4af24975c9a394ca3bf

PROBLEMS IN ENCODER- DECODER MODELS

IF YOU KNOW THE BASIC ARCHITECTURE OF AN ENCODER DECODER MODEL , YOU WILL RECOGNIZE TE PICTURE BELOW :

WHERE ” C” IS THE CONTEXT VECTOR .

IN SHORT THIS IS HOW IT WORKS :

THE ENCODER COMPRESSES ALL THE INFORMATION OF THE INPUT SENTENCE INTO ONE VECTOR(C) WHICH IS PASSED TO THE DECODER

WHICH USES IT TO DECODE THE OUTPUT .PRETTY SIMPLE!

NOW LETS TRY TO ANALYZE WITH AN ANALOGY HOW THIS STUFF WORKS . WE BUILD THE INTUITION FIRST , THEN WQE CAN JUMP INTO THE MATHEMATICS .

SUPPOSE WE ARE PLAYING A GAME , THERE ARE THREE CHILDREN , NAMELY ADAM , DANIEL AND VESPER ( I HOPE YOU GOT THE JAMES BOND REFERENCE! LOL) . THE GAME IS THAT DANIEL TELLS A STORY TO ADAM WHO IN TURN HAS TO EXPLAIN THE SAME STORY TO VESPER .

BUT THERE IS A CONDITION ! ADAM HAS A FIXED AMOUNT OF TIME , LET SAY T1 , ALLOTED FOR EVERY STORY .

NOW SUPPOSE THAT T1 =2 MINUTES .

THE FIRST STORY THAT DANIEL TELLS IS ABOUT HOW HIS WEEKEND WAS. ADAM COULD WELL EXPLAIN THE SUMMARY TO VESPER IN 2 MINUTES . IT WAS EASY FOR HIM .NEXT DANIEL TOLD HIM THE STORY OF HOW HIS LAST MONTH WAS . ADAM SOMEHOW STILL MANAGED .

NOW YOU SEE WHEN THE TROUBLE BEGINS . SUPPOSE DANIEL STATES A STORY THAT IS A SUMMARY OF HIS LAST 2 YEARS OF LIFE . CAOULD ADAM EVER JUSTIFY IT IN 2 MINUTES !!!! NEVER!!!

DANIEL IS THE ENCODER , “ADAM” IS OUR CONTEXT VECTOR , AND VESPER IS OUR DECODER . SHE TRIES TO FIGURE OUT WHAT DANIEL EXACTLY MEANT BY JUST THE SUMMARY THAT “THE CONTEXT VECTOR” FRIEND PROVIDED . YOU CAN SEE THE PROBLEM LONG “STORIES CAN LEAD TO . THIS IS ONE OF THE MOST BASIC PROBLEMS FACED BY A SIMPLE ENCODER DECODER MODEL . MATHEMATICALLY SPEAKING A SIMPLE MODEL AS ABOVE CANNOT REMEMBER LONG TERM RELATIONS .

MORE PRECISELY THE GRADIENTS ARE NOT ABLE TO SUSTAIN INFORMATION OVER THAT LONG RANGES . THE GRADIENTS SEEM TO “VANISH”.

ONE OF THE BETTER VERSIONS OF AN ENCODER DECODER ( ILL REFER IT AS ED FROM NOW, ITS A LONG WORD DUDE ) ARE “BIDIRECTIONAL MODELS ” . THE CORE IDEA IS THAT DURING TRANSLATING ANY SENTENCE WE DO NOT NECESSARILY GO IN ONE DIRECTION . SOMETIMES THE PROPER TRANSLATION OF A PARTICULAR PART OF THE SENTENCE MAY REQUIRE WORDS THAT OCCUR LATER . HAVE A LOOK :

AS YOU CAN SEE IN CONTRAST TO WHAT WAS HAPPENING EARLIER WE “MOVE ” IN BOTH DIRECTIONS . LET ME MAKE A POINT VERY CLEAR , WHEN WE SAY “MOVE” OR DRAW ANY NETWORK LIKE THE FIRST DIAGRAM , THERE ARE NOT MULTIPLE RNNS . WHAT YOU SEE IS THE TIME AXIS REPRESENTATION OF THE FORWARD PROP. EVEN ABOVE , THERE ARE ONLY 2 , YES ONLY 2 , BUT 4 TIME STEPS , THAT IS WHY YOU SEE 8 BLOCKS !! AND THE SAME FOR BACKPROPAGATION .

SO THIS APPROACH CAN MAKE THE RESULTS A LITTLE BETTER.

BUT AS THIS GUY SAID :

I’ M SURE YOU HEARD ABOUT LSTMS AND GRUS (YES YES FANCY WORDS )

THE THING THEY HELP IS TO SUSTAIN IMPORTANT INFROMATION OVER LONGER RANGES( HENCE THE NAME LONG SHORT TERM MEMOMRY UNITS) BUT THIS POST IS NOT ABOUT LSTMS .( NOR ABOUT THE GATES OF THE LSTM NETWORK) . THE MATHEMATICS OF THE LSTM NETWORK SEEMS A BIT OVERWHELMING TO SOME NOT BECAUSE ITS SOME WIZARDLY MATHEMATICS GOING OUT , BUT RATHER THAT HOW IS IT IMITATING A HUMAN MEMORY . LETS GET THE INTUITION OF AN LSTM .

LETS START WITH THE FORGET GATES AND CELL STATE .

UP FOR A STORY/ SUPPOSE YOU HAVE A FRIEND WHO TALKS WAY TOO MUCH , JUST WAYYY TOO MUCH . HE COMES TO YOUR HOME AND YOU ARE ON YOUR LAPTOP , AND HE STARTS TO SPEAK . HE IS SPEAKING FROM THE LAST HALF AN HOUR AND YOU DID’T CARE . SUDDENLY YOU HEAR YOUR CRUSH’S NAME POP UP , (LOL) , NOW THATS SOMETHING IMPORTANT RIGHT , SO YOUR MIND TAKES IT AS AN INPUT AND NOW EVERY TIME YOU HEAR THE WORD “SHE DID ” , “SHE SAID ” YOU TRY TO CONNECT THE DOTS AND YOU PAY ATTENTION TO THE PARTICULAR POINTS . OTHER (WHATEVER BLA BLA HE WAS SAYING IS LOST (FORGET GATE) (MATHEMATICALLY IT IS A VECTOR WHICH TELLS YOU THE IMPORTANCE OF EVERY FEATURE TO BE REMEMBERED ) . SEE THE ABOVE EXAMPLE IS EASY TO EXPLAIN ALL THE GATES IN ONE LSTM .

“WHEN TO FORGET , WHEN TO REMEMBER , HOW TO PROCESS THE NEXT INFO( “CELL STATE “) , WHAT YOU HAVE MADE OUT TILL NOW”OUTPUT.

NOW GO HAVE A LOOK AT THE MATH . YOU LEARNT HOW SOME INFORMATION WAS RELEVANT TO YOUR CRUSH BY “DATA ” RGHT? . THAT IS HOW THESE LITTLE NETWORKS DO . ILL MAKE A DIFFERENT POST FOR DETAILED MATHEMATICS .

NEXT WE WILL CONSIDER TRANSFORMERS .

TILL THEN HAVE A MARTINI , SHAKEN NOT STIRRED .

NESTEROV ACCELERATED DESCENT

NESTEROV ACCELERATED GRADIENT DESCENT

HOW NESTEROV GRADIENT DESCENT INCREASES EFFICIENCY OF MOMENTUM BASED DESCENTS

THE PROBLEM THAT A SIMPLE MOMENTUM BASED GRADIENT DESCENT PRODUCES IS THAT IT BECAUSE OF ITS “MOMENTUM ” IT OVERSHOOTS THE POINT OF DESIRED MINIMA AND HENCE LEAD TO LOTS OF OSCILLATIONS BEFORE REACHING THE DESIRED POINT .

below you can see a contour map of an error surface . (b is the bias axis , w is the weight axis ). The red portion is the area of low gradients and the blue portion contains the desired minima . the red line shows the current point over iterations :

the iteration starts from the red portion on the top right . using momentum we reach the minima region fast but overshoot due to momentum , hence a lot of oscillations before reaching the desired minima .

NESTEROV ACCELERATED GRADIENT DESCENT

The solution to the momentum problem near minima regions is obtained by using nesterov accelerated weight updating rule . It is based on the philosophy of ” look before you leap ” . This is how we try to handle the problem :

  1. From momentum descent we know that any instant the update depends on (the accumulated past) +(the gradient at that point ) .
  2. So , now what we do is first only calculate the accumulated past gradient , then move , and then find the gradient at that point . if the gradient at that point has change sign it means you have crossed the point of minima (zero gradient) . This helps to avoid overshooting .

lets look at the algorithm to make the things clear:

update rule for NESTEROV ACCELERATED GRADIENT DESCENT:

As we can see we first update the weight and get the weight(w look ahead ) . Then from that point we calculate the gradients . And if this gradient turns out to be of a different sign , it means we have overshooted the point of minima . the plot below makes this clear .

as compared to momentum descent here we first calculate the w look ahead(4a) , since the sign of gradient changes( from negative to positive ) we avoid that step .

lets have a look on how this will look in a code and also lets compare the nesterov gradient descent plot along with the one we get from simple momentum descent . below on the plot ( bias vs weights ) of the error surface you can see the nesterov descent in blue . You can see how drastically the oscillations near the minima region reduces .

happy traversing!!!

photo-1578741837482-285740b1a261

MOMENTUM BASED GRADIENT DESCENT

UNDERSTANDING HOW USING “MOMENTUM ” HELPS TO MAKE LEARNING EFFICIENT IN LOW GRADIENT REGIONS

AFTER LEARNING WHAT GRADIENT DESCENT MEANS , WE TRY TO ADDRESS ONE OF THE MAJOR PROBLEMS A SIMPLE WEIGHT UPDATING ALGORITHM FACES AND LATER WE DISCUSS HOW WE CAN SOLVE THE ISSUE AND HENCE MAKE THE LEARNING EFFICIENT . BUT BEFORE WE TRY TO UNDERSTAND THE PROBLEM ITSELF !!

WHAT WERE THE PARAMETERS AFFECTING THE UPDATE OF WEIGHTS IN THE GRADIENT DESCENT ALGORITHM . FOR ANY ITERATION , THE UPDATE DEPENDED ON ONLY 2 QUANTITIES :

  1. Gradient
  2. learning rate
NEURAL NETWORKS BACKPROPOGATION

THE GRADIENT DESCENT RULE WHERE ALPHA IS THE LEARNING RATE , J IS THE LOSS FUNCTION , Wij THE jth WEIGHT OF THE ith NEURON OF THE layer specified .

THE PROBLEM

Ever felt like just laying on the bed dunking your head in the pillow and doing nothing because you “lack motivation ” . Well this is the kind of problem our simple weight updating algorithm faces ! . The “motivation ” which drives every iteration is the gradient . How big an update will be depends on how large the gradient is . Suppose you enter a region on the error surface where the gradient is too small , almost a flat surface . This would make the product of (learning rate ) and (gradient ) too small and the weights wont get updated enough . Hence the next forward iteration would practically lead to produce the same output ( since the weights have not changed much ) and this would cause a “halt” in the learning process . Your network becomes stagnant and lots of iterations are wasted before you could escape the flattish low gradient portion of the error surface .

THE SOLUTION , USING “MOMENTUM “

Imagine yourself driving a car and trying to find a location . You start by asking a foodstall owner the directions (analogous to gradients) . He directs you towards the east . You slowly start driving towards east . At the next stall (analogous to the next iteration ) you again ask for directions and yet again he directs you towards the east . Now confident enough you start driving a little faster . And if a third stall owner also directs you towards east you atleast become sure that you are on the right path . So multiple sources pointing in the same direction helped you to get “momentum” .

In gradient descent how can we use this concept ? Following is the new algorithm that incorporates such “momentum”

At any instant t , the weight update not only depends on the current gradient and learning rate , but also on the history of the iterations , where the iterations of the near past are more important . This means if multiple iterations are pointing towards the same direction , the weight update would be more . below we see how to achieve this mathematically :

THE NEW WEIGHT UPDATING ALGORITHM

MOMENTUM BASED GRADIENT DESCENT

where gamma lies between 0 and 1 .

the above relation shows that the update at iteration t depends on the current gradient times learning rate + ( a part of the previous update ) . lets see how a few iterations using the above “momentum ” based approach looks like :

momentum based gradient descent

ANY UPDATE t IS THE EXPONENTIALLY WEIGHTED AVERAGE OF THE PAST UPDATES .

CONCLUSION

THIS MOMENTUM BASED APPROACH REDUCES THE PROBLEM OF “STAGNANCY ” IN LOW GRADIENT AREAS VERY EFFICIENTLY . BUT IT HAS ITS OWN SLIGHT DRAWBACK . A LITTLE ONE . BECAUSE WE ARE GAINING MOMENTUM AND HENCE MAKING LARGER UPDATES , WE MIGHT OVERSHOOT THE DESIRED LOWEST POINT AND THEN THIS WOULD LEAD TO AN OSCILLATORY BEHAVIOUR BEFORE WE COULD REACH THE DESIRED POINT . BECAUSE ONCE YOU OVERSHOOT , YOU AGAIN UPDATE WEIGHTS TO REACH THE MINIMA OF THE ERROR SURFACE . AGAIN DUE TO MOMENTUM YOU MIGHT OVERSHOOT (SUBSEQUENT OVERSHOOT S WOULD BE SMALLER ) .

LOSS FUNCTION PLOTTED AGAINST A WEIGHT . NOTICE HOW BIGGER UPDATES ARE MADE STARTING FROM 1 UPTO STEP 3, BUT THEN WE OVERSHOOT THE POINT OF DESIRED MINIMA .

lets try to visualise the problem on an error surface contour map whose axis are weight and bias respectively . in the map below the area in red stands for low gradient portions and area in blue stands for the high ones . the line in red shows the value of (w,b ) after every iteration . the initial weight and bias are set near (2,4).

the problem of overshooting the minima and oscillating near it can be easily visualised near the blue region

CAN THIS PROBLEM BE SOLVED ?

YES INDEED!!

IN THE NEXT ARTICLE WE DISCUSS THE NESTEROV MOMENTUM GRADIENT DESCENT ALGORITHM WHICH TRIES TO SOLVE THE PROBLEM RAISED ABOVE .

HAPPY LEARNING !

VANISHING GRADIENTS IN NEURAL NETWORKS

VANISHING GRADIENTS IN NEURAL NETWORKS

THE PROBLEM FACED DURING BACKPROPOGATION WHILE TRAINING NEURAL NETWORKS BACKPROPOGATION REFERS TO THE METHOD USED FOR OPTIMISING THE WEIGHTS AND BIASES OF A NEURAL NETWORK LEARNING MODEL . IT USES PARTIAL DERIVATIVES /GRADIENTS TO UPDATE THE WEIGHTS AFTER EVERY FORWARD CYCLE . FOLLOWING IS THE ALGORITHM USED ,ALSO KNOWN AS GRADIENT DESCENT ALGORITHM AS THE […]