IF YOU KNOW THE BASIC ARCHITECTURE OF AN ENCODER DECODER MODEL , YOU WILL RECOGNIZE TE PICTURE BELOW :

WHERE ” C” IS THE CONTEXT VECTOR .
IN SHORT THIS IS HOW IT WORKS :
THE ENCODER COMPRESSES ALL THE INFORMATION OF THE INPUT SENTENCE INTO ONE VECTOR(C) WHICH IS PASSED TO THE DECODER
WHICH USES IT TO DECODE THE OUTPUT .PRETTY SIMPLE!
NOW LETS TRY TO ANALYZE WITH AN ANALOGY HOW THIS STUFF WORKS . WE BUILD THE INTUITION FIRST , THEN WQE CAN JUMP INTO THE MATHEMATICS .
SUPPOSE WE ARE PLAYING A GAME , THERE ARE THREE CHILDREN , NAMELY ADAM , DANIEL AND VESPER ( I HOPE YOU GOT THE JAMES BOND REFERENCE! LOL) . THE GAME IS THAT DANIEL TELLS A STORY TO ADAM WHO IN TURN HAS TO EXPLAIN THE SAME STORY TO VESPER .
BUT THERE IS A CONDITION ! ADAM HAS A FIXED AMOUNT OF TIME , LET SAY T1 , ALLOTED FOR EVERY STORY .
NOW SUPPOSE THAT T1 =2 MINUTES .
THE FIRST STORY THAT DANIEL TELLS IS ABOUT HOW HIS WEEKEND WAS. ADAM COULD WELL EXPLAIN THE SUMMARY TO VESPER IN 2 MINUTES . IT WAS EASY FOR HIM .NEXT DANIEL TOLD HIM THE STORY OF HOW HIS LAST MONTH WAS . ADAM SOMEHOW STILL MANAGED .
NOW YOU SEE WHEN THE TROUBLE BEGINS . SUPPOSE DANIEL STATES A STORY THAT IS A SUMMARY OF HIS LAST 2 YEARS OF LIFE . CAOULD ADAM EVER JUSTIFY IT IN 2 MINUTES !!!! NEVER!!!
DANIEL IS THE ENCODER , “ADAM” IS OUR CONTEXT VECTOR , AND VESPER IS OUR DECODER . SHE TRIES TO FIGURE OUT WHAT DANIEL EXACTLY MEANT BY JUST THE SUMMARY THAT “THE CONTEXT VECTOR” FRIEND PROVIDED . YOU CAN SEE THE PROBLEM LONG “STORIES CAN LEAD TO . THIS IS ONE OF THE MOST BASIC PROBLEMS FACED BY A SIMPLE ENCODER DECODER MODEL . MATHEMATICALLY SPEAKING A SIMPLE MODEL AS ABOVE CANNOT REMEMBER LONG TERM RELATIONS .
MORE PRECISELY THE GRADIENTS ARE NOT ABLE TO SUSTAIN INFORMATION OVER THAT LONG RANGES . THE GRADIENTS SEEM TO “VANISH”.
ONE OF THE BETTER VERSIONS OF AN ENCODER DECODER ( ILL REFER IT AS ED FROM NOW, ITS A LONG WORD DUDE ) ARE “BIDIRECTIONAL MODELS ” . THE CORE IDEA IS THAT DURING TRANSLATING ANY SENTENCE WE DO NOT NECESSARILY GO IN ONE DIRECTION . SOMETIMES THE PROPER TRANSLATION OF A PARTICULAR PART OF THE SENTENCE MAY REQUIRE WORDS THAT OCCUR LATER . HAVE A LOOK :

AS YOU CAN SEE IN CONTRAST TO WHAT WAS HAPPENING EARLIER WE “MOVE ” IN BOTH DIRECTIONS . LET ME MAKE A POINT VERY CLEAR , WHEN WE SAY “MOVE” OR DRAW ANY NETWORK LIKE THE FIRST DIAGRAM , THERE ARE NOT MULTIPLE RNNS . WHAT YOU SEE IS THE TIME AXIS REPRESENTATION OF THE FORWARD PROP. EVEN ABOVE , THERE ARE ONLY 2 , YES ONLY 2 , BUT 4 TIME STEPS , THAT IS WHY YOU SEE 8 BLOCKS !! AND THE SAME FOR BACKPROPAGATION .
SO THIS APPROACH CAN MAKE THE RESULTS A LITTLE BETTER.
BUT AS THIS GUY SAID :

I’ M SURE YOU HEARD ABOUT LSTMS AND GRUS (YES YES FANCY WORDS )
THE THING THEY HELP IS TO SUSTAIN IMPORTANT INFROMATION OVER LONGER RANGES( HENCE THE NAME LONG SHORT TERM MEMOMRY UNITS) BUT THIS POST IS NOT ABOUT LSTMS .( NOR ABOUT THE GATES OF THE LSTM NETWORK) . THE MATHEMATICS OF THE LSTM NETWORK SEEMS A BIT OVERWHELMING TO SOME NOT BECAUSE ITS SOME WIZARDLY MATHEMATICS GOING OUT , BUT RATHER THAT HOW IS IT IMITATING A HUMAN MEMORY . LETS GET THE INTUITION OF AN LSTM .
LETS START WITH THE FORGET GATES AND CELL STATE .
UP FOR A STORY/ SUPPOSE YOU HAVE A FRIEND WHO TALKS WAY TOO MUCH , JUST WAYYY TOO MUCH . HE COMES TO YOUR HOME AND YOU ARE ON YOUR LAPTOP , AND HE STARTS TO SPEAK . HE IS SPEAKING FROM THE LAST HALF AN HOUR AND YOU DID’T CARE . SUDDENLY YOU HEAR YOUR CRUSH’S NAME POP UP , (LOL) , NOW THATS SOMETHING IMPORTANT RIGHT , SO YOUR MIND TAKES IT AS AN INPUT AND NOW EVERY TIME YOU HEAR THE WORD “SHE DID ” , “SHE SAID ” YOU TRY TO CONNECT THE DOTS AND YOU PAY ATTENTION TO THE PARTICULAR POINTS . OTHER (WHATEVER BLA BLA HE WAS SAYING IS LOST (FORGET GATE) (MATHEMATICALLY IT IS A VECTOR WHICH TELLS YOU THE IMPORTANCE OF EVERY FEATURE TO BE REMEMBERED ) . SEE THE ABOVE EXAMPLE IS EASY TO EXPLAIN ALL THE GATES IN ONE LSTM .
“WHEN TO FORGET , WHEN TO REMEMBER , HOW TO PROCESS THE NEXT INFO( “CELL STATE “) , WHAT YOU HAVE MADE OUT TILL NOW”OUTPUT.
NOW GO HAVE A LOOK AT THE MATH . YOU LEARNT HOW SOME INFORMATION WAS RELEVANT TO YOUR CRUSH BY “DATA ” RGHT? . THAT IS HOW THESE LITTLE NETWORKS DO . ILL MAKE A DIFFERENT POST FOR DETAILED MATHEMATICS .
NEXT WE WILL CONSIDER TRANSFORMERS .
TILL THEN HAVE A MARTINI , SHAKEN NOT STIRRED .
Add a Comment
You must be logged in to post a comment