# PROBLEMS IN ENCODER- DECODER MODELS

IF YOU KNOW THE BASIC ARCHITECTURE OF AN ENCODER DECODER MODEL , YOU WILL RECOGNIZE TE PICTURE BELOW :

WHERE ” C” IS THE CONTEXT VECTOR .

IN SHORT THIS IS HOW IT WORKS :

THE ENCODER COMPRESSES ALL THE INFORMATION OF THE INPUT SENTENCE INTO ONE VECTOR(C) WHICH IS PASSED TO THE DECODER

WHICH USES IT TO DECODE THE OUTPUT .PRETTY SIMPLE!

NOW LETS TRY TO ANALYZE WITH AN ANALOGY HOW THIS STUFF WORKS . WE BUILD THE INTUITION FIRST , THEN WQE CAN JUMP INTO THE MATHEMATICS .

SUPPOSE WE ARE PLAYING A GAME , THERE ARE THREE CHILDREN , NAMELY ADAM , DANIEL AND VESPER ( I HOPE YOU GOT THE JAMES BOND REFERENCE! LOL) . THE GAME IS THAT DANIEL TELLS A STORY TO ADAM WHO IN TURN HAS TO EXPLAIN THE SAME STORY TO VESPER .

BUT THERE IS A CONDITION ! ADAM HAS A FIXED AMOUNT OF TIME , LET SAY T1 , ALLOTED FOR EVERY STORY .

NOW SUPPOSE THAT T1 =2 MINUTES .

THE FIRST STORY THAT DANIEL TELLS IS ABOUT HOW HIS WEEKEND WAS. ADAM COULD WELL EXPLAIN THE SUMMARY TO VESPER IN 2 MINUTES . IT WAS EASY FOR HIM .NEXT DANIEL TOLD HIM THE STORY OF HOW HIS LAST MONTH WAS . ADAM SOMEHOW STILL MANAGED .

NOW YOU SEE WHEN THE TROUBLE BEGINS . SUPPOSE DANIEL STATES A STORY THAT IS A SUMMARY OF HIS LAST 2 YEARS OF LIFE . CAOULD ADAM EVER JUSTIFY IT IN 2 MINUTES !!!! NEVER!!!

DANIEL IS THE ENCODER , “ADAM” IS OUR CONTEXT VECTOR , AND VESPER IS OUR DECODER . SHE TRIES TO FIGURE OUT WHAT DANIEL EXACTLY MEANT BY JUST THE SUMMARY THAT “THE CONTEXT VECTOR” FRIEND PROVIDED . YOU CAN SEE THE PROBLEM LONG “STORIES CAN LEAD TO . THIS IS ONE OF THE MOST BASIC PROBLEMS FACED BY A SIMPLE ENCODER DECODER MODEL . MATHEMATICALLY SPEAKING A SIMPLE MODEL AS ABOVE CANNOT REMEMBER LONG TERM RELATIONS .

MORE PRECISELY THE GRADIENTS ARE NOT ABLE TO SUSTAIN INFORMATION OVER THAT LONG RANGES . THE GRADIENTS SEEM TO “VANISH”.

ONE OF THE BETTER VERSIONS OF AN ENCODER DECODER ( ILL REFER IT AS ED FROM NOW, ITS A LONG WORD DUDE ) ARE “BIDIRECTIONAL MODELS ” . THE CORE IDEA IS THAT DURING TRANSLATING ANY SENTENCE WE DO NOT NECESSARILY GO IN ONE DIRECTION . SOMETIMES THE PROPER TRANSLATION OF A PARTICULAR PART OF THE SENTENCE MAY REQUIRE WORDS THAT OCCUR LATER . HAVE A LOOK :

AS YOU CAN SEE IN CONTRAST TO WHAT WAS HAPPENING EARLIER WE “MOVE ” IN BOTH DIRECTIONS . LET ME MAKE A POINT VERY CLEAR , WHEN WE SAY “MOVE” OR DRAW ANY NETWORK LIKE THE FIRST DIAGRAM , THERE ARE NOT MULTIPLE RNNS . WHAT YOU SEE IS THE TIME AXIS REPRESENTATION OF THE FORWARD PROP. EVEN ABOVE , THERE ARE ONLY 2 , YES ONLY 2 , BUT 4 TIME STEPS , THAT IS WHY YOU SEE 8 BLOCKS !! AND THE SAME FOR BACKPROPAGATION .

SO THIS APPROACH CAN MAKE THE RESULTS A LITTLE BETTER.

BUT AS THIS GUY SAID :

I’ M SURE YOU HEARD ABOUT LSTMS AND GRUS (YES YES FANCY WORDS )

THE THING THEY HELP IS TO SUSTAIN IMPORTANT INFROMATION OVER LONGER RANGES( HENCE THE NAME LONG SHORT TERM MEMOMRY UNITS) BUT THIS POST IS NOT ABOUT LSTMS .( NOR ABOUT THE GATES OF THE LSTM NETWORK) . THE MATHEMATICS OF THE LSTM NETWORK SEEMS A BIT OVERWHELMING TO SOME NOT BECAUSE ITS SOME WIZARDLY MATHEMATICS GOING OUT , BUT RATHER THAT HOW IS IT IMITATING A HUMAN MEMORY . LETS GET THE INTUITION OF AN LSTM .