photo-1518621736915-f3b1c41bfd00

IMPLEMENTING ATTENTION MECHANISM FROM SCRATCH

how attention mechanism layer is an upgrade from one context vector seq to seq models

I MEAN WHO DOESN’T CRAVE A LITTLE ATTENTION ? IT ONLY HELPS SO MUCH . MACHINE TRANSLATION AND NEURAL NETS HAVE A LONG HISTORY . BUT THE “ATTENTION MECHANISM ” IS NOT VERY OLD . APPLYING “ATTENTION” -AN ANALOGY TO HOW HUMANS READ AND PERCIEVE TEXT/SEQUENTIAL INFORMATION HAS HELPED IN ACHIEVING BETTER RESULTS IN THE FIELD OF MACHINE TRANSLATION .

HERE WE TRY TO UNDERSTAND THE IMPLEMENTATION OF AN ATTENTION LAYER , WHICH IF USED (AS AN IMPORTED PACKAGE/LIBRARY) IS A ONE LINER , BUT IF IMPLEMENTED FROM SCRATCH REQUIRES SOME WORK.

WE ALREADY SAW HOW IN SIMPLE ENCODER TO DECODER MODELS , A SINGLE CONTEXT VECTOR IS WHAT CARRIES THE “SUMMARISED INFO” OF THE ENTIRE INPUT SEQUENCE .

FOLLOWING IS THE INTUITION OF ATTENTION MECHANISM WITHOUT ANY MATH :

HUMANS WHEN TRYING TO TRANSLATE OR SUMMARIZE ANY SENTENCE TEND TO NOTICE THE ENTIRE SENTENCE AND KEEP A NOTICE OF HOW ALL THE WORDS , A SARCASM , ANY REFERENCES ARE RELATED TO EACH OTHER . SO , BASICALLY HOW WORDS TOGETHER ARE RESPONSIBLE FOR THE FINAL TRANSLATION .

I’LL ASSUME THAT THE READER IS ALREADY FAMILIAR WITH WORD EMBEDDINGS ,TOKENISATION,DENSE LAYERS AND EMBEDDING MATRIX.

SO OUR PROBLEM STATEMENT IS FRECH TO ENGLISH TRANSLATION

WE PERFORM TOKENISATION , CREATE EMBEDDING MATRIX AND NOW WE ARE WANTING TO ADD AN ATTENTION LAYER IN BETWEEN THE INPUT AND OUTPUT SEQUENCES . I HOPE THE INTUITION IS CLEAR .

INSTEAD OF PASSING ONE CONTEXT VECTOR, WE WANT OUR MODEL TO SEE ALL THE STATES , AND DECIDE WHAT FEATURES ARE MORE IMPORATNT DURING TRANSLATION AT EACH STAGE IN THE DECODER.

ITS TIME TO VISUALIZE THE ABOVE STATEMENT . HAVE A LOOK :

IMAGE REF: https://sknadig.dev/basics-attention/

LETS BREAK IT DOWN , WHAT YOU SEE IS THE LSTMS (BIDIRECTIONAL ENCODER) CREATING THEIR RESPECTIVE STATES , AND INSTEAD OF PASSING THE LAST STATE WE ARE PASSING WEIGHTED STATES , WEIGHTED BY FACTORS ALPHA (LEARNABLE PARAMETER ),DIFFERENT FOR EACH STATE , THIS WEIGHTED CONTEXT VECTOR +THE PRESENT DECODER STATE TOGETHER DECIDE THE NEXT STATE AND OUTPUT .

THE ABOVE PARA COMPLETELY SUMMARISES THE INTUITION . NOW LETS SEE SOME MATH . THE FINAL CONTEXT VECTOR IS THE SUM OF THE WEIGHTED HIDDEN STATES . THE OTHER CONDITION BEING THAT THE ALPHAS ARE NORMALISED TO ONE . NOW HOW TO DECIDE THE ALPHAS ?

WELL IT TURNS OUT, ” WHAT BETTER THAN A NEURAL NETWORK TO DECIDE AN APPROXIMATE FUNCTION” , HENCE :

THIS COMPTIBALITY FUNCTION(WHICH IS LTRAINABLE NETWORK IS OF DIFFERENT TYPES )

HERE WE DISCUSSION BAHDANAU ATTENTION MECHANISM (LOUNG BEING THE OTHER). LETS START BY A LITTLE CODE BEFORE GOING INTO FURTHER DETAILS . SO THE FIRST THING WE NEED IS AN ENCODER ,

ENCODER

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

WHAT WE DID ABOVE WAS JUST MAKE AN ENCODER CLASS , AND MAKE ONE OBJECT FROM THAT CLASS , ( YOU CAN USE GRU/LSTM ) , INPUT VOCAB SIZE= VOCAB SIZE OBTAINED AFTER TOKENISING YOUR DATA , THE CODE IS REFERENCED FROM TENSORFLOW NEURAL MACHINE TRANSLATION .

NOW IN BETWEEN ENCODER AND DECODER WE INTRODUCE AN ATTENTION LAYER . JUST LIKE ABOVE WE CODE A CLASS AND MAKE AN OBJECT.

THIS IS BASICALLY THE COMPATIBILITY FUNCTION AND THE ” NUMBER OF ATTENTION UNITS” DECIDES THE NUMBER OF UNITS IN DENSE LAYERS(RESPECTIVE OF THE SCORING FUNCTION USED) THAT YOU WILL INITIALIZE FOR TRAINING .HENCE ITS A HYPER PARAMETER.

BAHDANAU ATTENTION mechanism

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # query hidden state shape == (batch_size, hidden size)
    # query_with_time_axis shape == (batch_size, 1, hidden size)
    # values shape == (batch_size, max_len, hidden size)
    # we are doing this to broadcast addition along the time axis to calculate the score
    query_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
attention_layer = BahdanauAttention(10)

THE “SCORE” FUNCTION USED IN THE CALL FUNCTION HERE IS “CONCAT” . IN TOTAL THERE ARE 3 VARIETIES OF SCORING FUNCTIONS THAT ARE USED .

  1. CONCAT
  2. DOT
  3. GENERAL

DEPENDING ON THE SCORING FUNCTIONS WE INITIALIZE OUR PARAMETERS IN THE ATTENTION CLASS. FOR THE DIFFERENT SCORING FUNCTIONS REFER THIS.

NOW WE WILL DEFIN E OUR DECODER CLASS , NOTICE HOW WE USE ATTENTION OBJECT WITHIN THE DFECODER CLASS . THIS ATTENTION TAKES INPUT FROM THE ENCODER STATES , PERFORMS THE “ATTENTON MECHANISM” OPERATION AND THEN WE DO THE “DECODING” PART . IT RETURNS THE ATTENTION WEIGHTS AND OUTPUT STATE .

DECODER

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

WHERE vocab_tar_size IS THE VOCAB SIZE AFTER TOKENISATION OF TARGET LANGUAGE .

SO THE FINAL PICTURE IS SOMEWHAT LIKE (DOT BASED ATTENTION):

  1. THE e’s sre supplying the normalised alphas , the alphas are performing the weighting operation , and together with present decoder state giving us the final results.

HERE I HAVE DISCUSSED THE VISUAL REPRESENTATION OF THE

ENCODER——>ATTENTION——–>DECODER PART AND ITS MATHEMATICS .

FURTHER WHAT WE SEE ABOVE IS GLOBAL ATTENTION , ANOTHER APRROACH IS LOCAL ATTENTION WHERE INSTEAD OF LOOKING AT THE ENTIRE SENTANCE WE MIGHT BE INTERESTED IN A WINDOW OF WORDS . NOW AGAIN THAT WOULD INCREASE A HYPERPARAMETER 😛 .

FURTHER STEPS ARE DEFINING OPTIMIZER, LOSS FUNCTIONS AND USING A METHOD CALLED “TEACHER FORCING ” TO TRAIN THE MODEL . FOR FURTHER READING REFER :https://www.tensorflow.org/tutorials/text/nmt_with_attention

fdfef4c9f9aea4af24975c9a394ca3bf

PROBLEMS IN ENCODER- DECODER MODELS

IF YOU KNOW THE BASIC ARCHITECTURE OF AN ENCODER DECODER MODEL , YOU WILL RECOGNIZE TE PICTURE BELOW :

WHERE ” C” IS THE CONTEXT VECTOR .

IN SHORT THIS IS HOW IT WORKS :

THE ENCODER COMPRESSES ALL THE INFORMATION OF THE INPUT SENTENCE INTO ONE VECTOR(C) WHICH IS PASSED TO THE DECODER

WHICH USES IT TO DECODE THE OUTPUT .PRETTY SIMPLE!

NOW LETS TRY TO ANALYZE WITH AN ANALOGY HOW THIS STUFF WORKS . WE BUILD THE INTUITION FIRST , THEN WQE CAN JUMP INTO THE MATHEMATICS .

SUPPOSE WE ARE PLAYING A GAME , THERE ARE THREE CHILDREN , NAMELY ADAM , DANIEL AND VESPER ( I HOPE YOU GOT THE JAMES BOND REFERENCE! LOL) . THE GAME IS THAT DANIEL TELLS A STORY TO ADAM WHO IN TURN HAS TO EXPLAIN THE SAME STORY TO VESPER .

BUT THERE IS A CONDITION ! ADAM HAS A FIXED AMOUNT OF TIME , LET SAY T1 , ALLOTED FOR EVERY STORY .

NOW SUPPOSE THAT T1 =2 MINUTES .

THE FIRST STORY THAT DANIEL TELLS IS ABOUT HOW HIS WEEKEND WAS. ADAM COULD WELL EXPLAIN THE SUMMARY TO VESPER IN 2 MINUTES . IT WAS EASY FOR HIM .NEXT DANIEL TOLD HIM THE STORY OF HOW HIS LAST MONTH WAS . ADAM SOMEHOW STILL MANAGED .

NOW YOU SEE WHEN THE TROUBLE BEGINS . SUPPOSE DANIEL STATES A STORY THAT IS A SUMMARY OF HIS LAST 2 YEARS OF LIFE . CAOULD ADAM EVER JUSTIFY IT IN 2 MINUTES !!!! NEVER!!!

DANIEL IS THE ENCODER , “ADAM” IS OUR CONTEXT VECTOR , AND VESPER IS OUR DECODER . SHE TRIES TO FIGURE OUT WHAT DANIEL EXACTLY MEANT BY JUST THE SUMMARY THAT “THE CONTEXT VECTOR” FRIEND PROVIDED . YOU CAN SEE THE PROBLEM LONG “STORIES CAN LEAD TO . THIS IS ONE OF THE MOST BASIC PROBLEMS FACED BY A SIMPLE ENCODER DECODER MODEL . MATHEMATICALLY SPEAKING A SIMPLE MODEL AS ABOVE CANNOT REMEMBER LONG TERM RELATIONS .

MORE PRECISELY THE GRADIENTS ARE NOT ABLE TO SUSTAIN INFORMATION OVER THAT LONG RANGES . THE GRADIENTS SEEM TO “VANISH”.

ONE OF THE BETTER VERSIONS OF AN ENCODER DECODER ( ILL REFER IT AS ED FROM NOW, ITS A LONG WORD DUDE ) ARE “BIDIRECTIONAL MODELS ” . THE CORE IDEA IS THAT DURING TRANSLATING ANY SENTENCE WE DO NOT NECESSARILY GO IN ONE DIRECTION . SOMETIMES THE PROPER TRANSLATION OF A PARTICULAR PART OF THE SENTENCE MAY REQUIRE WORDS THAT OCCUR LATER . HAVE A LOOK :

AS YOU CAN SEE IN CONTRAST TO WHAT WAS HAPPENING EARLIER WE “MOVE ” IN BOTH DIRECTIONS . LET ME MAKE A POINT VERY CLEAR , WHEN WE SAY “MOVE” OR DRAW ANY NETWORK LIKE THE FIRST DIAGRAM , THERE ARE NOT MULTIPLE RNNS . WHAT YOU SEE IS THE TIME AXIS REPRESENTATION OF THE FORWARD PROP. EVEN ABOVE , THERE ARE ONLY 2 , YES ONLY 2 , BUT 4 TIME STEPS , THAT IS WHY YOU SEE 8 BLOCKS !! AND THE SAME FOR BACKPROPAGATION .

SO THIS APPROACH CAN MAKE THE RESULTS A LITTLE BETTER.

BUT AS THIS GUY SAID :

I’ M SURE YOU HEARD ABOUT LSTMS AND GRUS (YES YES FANCY WORDS )

THE THING THEY HELP IS TO SUSTAIN IMPORTANT INFROMATION OVER LONGER RANGES( HENCE THE NAME LONG SHORT TERM MEMOMRY UNITS) BUT THIS POST IS NOT ABOUT LSTMS .( NOR ABOUT THE GATES OF THE LSTM NETWORK) . THE MATHEMATICS OF THE LSTM NETWORK SEEMS A BIT OVERWHELMING TO SOME NOT BECAUSE ITS SOME WIZARDLY MATHEMATICS GOING OUT , BUT RATHER THAT HOW IS IT IMITATING A HUMAN MEMORY . LETS GET THE INTUITION OF AN LSTM .

LETS START WITH THE FORGET GATES AND CELL STATE .

UP FOR A STORY/ SUPPOSE YOU HAVE A FRIEND WHO TALKS WAY TOO MUCH , JUST WAYYY TOO MUCH . HE COMES TO YOUR HOME AND YOU ARE ON YOUR LAPTOP , AND HE STARTS TO SPEAK . HE IS SPEAKING FROM THE LAST HALF AN HOUR AND YOU DID’T CARE . SUDDENLY YOU HEAR YOUR CRUSH’S NAME POP UP , (LOL) , NOW THATS SOMETHING IMPORTANT RIGHT , SO YOUR MIND TAKES IT AS AN INPUT AND NOW EVERY TIME YOU HEAR THE WORD “SHE DID ” , “SHE SAID ” YOU TRY TO CONNECT THE DOTS AND YOU PAY ATTENTION TO THE PARTICULAR POINTS . OTHER (WHATEVER BLA BLA HE WAS SAYING IS LOST (FORGET GATE) (MATHEMATICALLY IT IS A VECTOR WHICH TELLS YOU THE IMPORTANCE OF EVERY FEATURE TO BE REMEMBERED ) . SEE THE ABOVE EXAMPLE IS EASY TO EXPLAIN ALL THE GATES IN ONE LSTM .

“WHEN TO FORGET , WHEN TO REMEMBER , HOW TO PROCESS THE NEXT INFO( “CELL STATE “) , WHAT YOU HAVE MADE OUT TILL NOW”OUTPUT.

NOW GO HAVE A LOOK AT THE MATH . YOU LEARNT HOW SOME INFORMATION WAS RELEVANT TO YOUR CRUSH BY “DATA ” RGHT? . THAT IS HOW THESE LITTLE NETWORKS DO . ILL MAKE A DIFFERENT POST FOR DETAILED MATHEMATICS .

NEXT WE WILL CONSIDER TRANSFORMERS .

TILL THEN HAVE A MARTINI , SHAKEN NOT STIRRED .

NESTEROV ACCELERATED DESCENT

NESTEROV ACCELERATED GRADIENT DESCENT

HOW NESTEROV GRADIENT DESCENT INCREASES EFFICIENCY OF MOMENTUM BASED DESCENTS

THE PROBLEM THAT A SIMPLE MOMENTUM BASED GRADIENT DESCENT PRODUCES IS THAT IT BECAUSE OF ITS “MOMENTUM ” IT OVERSHOOTS THE POINT OF DESIRED MINIMA AND HENCE LEAD TO LOTS OF OSCILLATIONS BEFORE REACHING THE DESIRED POINT .

below you can see a contour map of an error surface . (b is the bias axis , w is the weight axis ). The red portion is the area of low gradients and the blue portion contains the desired minima . the red line shows the current point over iterations :

the iteration starts from the red portion on the top right . using momentum we reach the minima region fast but overshoot due to momentum , hence a lot of oscillations before reaching the desired minima .

NESTEROV ACCELERATED GRADIENT DESCENT

The solution to the momentum problem near minima regions is obtained by using nesterov accelerated weight updating rule . It is based on the philosophy of ” look before you leap ” . This is how we try to handle the problem :

  1. From momentum descent we know that any instant the update depends on (the accumulated past) +(the gradient at that point ) .
  2. So , now what we do is first only calculate the accumulated past gradient , then move , and then find the gradient at that point . if the gradient at that point has change sign it means you have crossed the point of minima (zero gradient) . This helps to avoid overshooting .

lets look at the algorithm to make the things clear:

update rule for NESTEROV ACCELERATED GRADIENT DESCENT:

As we can see we first update the weight and get the weight(w look ahead ) . Then from that point we calculate the gradients . And if this gradient turns out to be of a different sign , it means we have overshooted the point of minima . the plot below makes this clear .

as compared to momentum descent here we first calculate the w look ahead(4a) , since the sign of gradient changes( from negative to positive ) we avoid that step .

lets have a look on how this will look in a code and also lets compare the nesterov gradient descent plot along with the one we get from simple momentum descent . below on the plot ( bias vs weights ) of the error surface you can see the nesterov descent in blue . You can see how drastically the oscillations near the minima region reduces .

happy traversing!!!