photo-1518621736915-f3b1c41bfd00

IMPLEMENTING ATTENTION MECHANISM FROM SCRATCH

how attention mechanism layer is an upgrade from one context vector seq to seq models

I MEAN WHO DOESN’T CRAVE A LITTLE ATTENTION ? IT ONLY HELPS SO MUCH . MACHINE TRANSLATION AND NEURAL NETS HAVE A LONG HISTORY . BUT THE “ATTENTION MECHANISM ” IS NOT VERY OLD . APPLYING “ATTENTION” -AN ANALOGY TO HOW HUMANS READ AND PERCIEVE TEXT/SEQUENTIAL INFORMATION HAS HELPED IN ACHIEVING BETTER RESULTS IN THE FIELD OF MACHINE TRANSLATION .

HERE WE TRY TO UNDERSTAND THE IMPLEMENTATION OF AN ATTENTION LAYER , WHICH IF USED (AS AN IMPORTED PACKAGE/LIBRARY) IS A ONE LINER , BUT IF IMPLEMENTED FROM SCRATCH REQUIRES SOME WORK.

WE ALREADY SAW HOW IN SIMPLE ENCODER TO DECODER MODELS , A SINGLE CONTEXT VECTOR IS WHAT CARRIES THE “SUMMARISED INFO” OF THE ENTIRE INPUT SEQUENCE .

FOLLOWING IS THE INTUITION OF ATTENTION MECHANISM WITHOUT ANY MATH :

HUMANS WHEN TRYING TO TRANSLATE OR SUMMARIZE ANY SENTENCE TEND TO NOTICE THE ENTIRE SENTENCE AND KEEP A NOTICE OF HOW ALL THE WORDS , A SARCASM , ANY REFERENCES ARE RELATED TO EACH OTHER . SO , BASICALLY HOW WORDS TOGETHER ARE RESPONSIBLE FOR THE FINAL TRANSLATION .

I’LL ASSUME THAT THE READER IS ALREADY FAMILIAR WITH WORD EMBEDDINGS ,TOKENISATION,DENSE LAYERS AND EMBEDDING MATRIX.

SO OUR PROBLEM STATEMENT IS FRECH TO ENGLISH TRANSLATION

WE PERFORM TOKENISATION , CREATE EMBEDDING MATRIX AND NOW WE ARE WANTING TO ADD AN ATTENTION LAYER IN BETWEEN THE INPUT AND OUTPUT SEQUENCES . I HOPE THE INTUITION IS CLEAR .

INSTEAD OF PASSING ONE CONTEXT VECTOR, WE WANT OUR MODEL TO SEE ALL THE STATES , AND DECIDE WHAT FEATURES ARE MORE IMPORATNT DURING TRANSLATION AT EACH STAGE IN THE DECODER.

ITS TIME TO VISUALIZE THE ABOVE STATEMENT . HAVE A LOOK :

IMAGE REF: https://sknadig.dev/basics-attention/

LETS BREAK IT DOWN , WHAT YOU SEE IS THE LSTMS (BIDIRECTIONAL ENCODER) CREATING THEIR RESPECTIVE STATES , AND INSTEAD OF PASSING THE LAST STATE WE ARE PASSING WEIGHTED STATES , WEIGHTED BY FACTORS ALPHA (LEARNABLE PARAMETER ),DIFFERENT FOR EACH STATE , THIS WEIGHTED CONTEXT VECTOR +THE PRESENT DECODER STATE TOGETHER DECIDE THE NEXT STATE AND OUTPUT .

THE ABOVE PARA COMPLETELY SUMMARISES THE INTUITION . NOW LETS SEE SOME MATH . THE FINAL CONTEXT VECTOR IS THE SUM OF THE WEIGHTED HIDDEN STATES . THE OTHER CONDITION BEING THAT THE ALPHAS ARE NORMALISED TO ONE . NOW HOW TO DECIDE THE ALPHAS ?

WELL IT TURNS OUT, ” WHAT BETTER THAN A NEURAL NETWORK TO DECIDE AN APPROXIMATE FUNCTION” , HENCE :

THIS COMPTIBALITY FUNCTION(WHICH IS LTRAINABLE NETWORK IS OF DIFFERENT TYPES )

HERE WE DISCUSSION BAHDANAU ATTENTION MECHANISM (LOUNG BEING THE OTHER). LETS START BY A LITTLE CODE BEFORE GOING INTO FURTHER DETAILS . SO THE FIRST THING WE NEED IS AN ENCODER ,

ENCODER

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

WHAT WE DID ABOVE WAS JUST MAKE AN ENCODER CLASS , AND MAKE ONE OBJECT FROM THAT CLASS , ( YOU CAN USE GRU/LSTM ) , INPUT VOCAB SIZE= VOCAB SIZE OBTAINED AFTER TOKENISING YOUR DATA , THE CODE IS REFERENCED FROM TENSORFLOW NEURAL MACHINE TRANSLATION .

NOW IN BETWEEN ENCODER AND DECODER WE INTRODUCE AN ATTENTION LAYER . JUST LIKE ABOVE WE CODE A CLASS AND MAKE AN OBJECT.

THIS IS BASICALLY THE COMPATIBILITY FUNCTION AND THE ” NUMBER OF ATTENTION UNITS” DECIDES THE NUMBER OF UNITS IN DENSE LAYERS(RESPECTIVE OF THE SCORING FUNCTION USED) THAT YOU WILL INITIALIZE FOR TRAINING .HENCE ITS A HYPER PARAMETER.

BAHDANAU ATTENTION mechanism

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # query hidden state shape == (batch_size, hidden size)
    # query_with_time_axis shape == (batch_size, 1, hidden size)
    # values shape == (batch_size, max_len, hidden size)
    # we are doing this to broadcast addition along the time axis to calculate the score
    query_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
attention_layer = BahdanauAttention(10)

THE “SCORE” FUNCTION USED IN THE CALL FUNCTION HERE IS “CONCAT” . IN TOTAL THERE ARE 3 VARIETIES OF SCORING FUNCTIONS THAT ARE USED .

  1. CONCAT
  2. DOT
  3. GENERAL

DEPENDING ON THE SCORING FUNCTIONS WE INITIALIZE OUR PARAMETERS IN THE ATTENTION CLASS. FOR THE DIFFERENT SCORING FUNCTIONS REFER THIS.

NOW WE WILL DEFIN E OUR DECODER CLASS , NOTICE HOW WE USE ATTENTION OBJECT WITHIN THE DFECODER CLASS . THIS ATTENTION TAKES INPUT FROM THE ENCODER STATES , PERFORMS THE “ATTENTON MECHANISM” OPERATION AND THEN WE DO THE “DECODING” PART . IT RETURNS THE ATTENTION WEIGHTS AND OUTPUT STATE .

DECODER

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

WHERE vocab_tar_size IS THE VOCAB SIZE AFTER TOKENISATION OF TARGET LANGUAGE .

SO THE FINAL PICTURE IS SOMEWHAT LIKE (DOT BASED ATTENTION):

  1. THE e’s sre supplying the normalised alphas , the alphas are performing the weighting operation , and together with present decoder state giving us the final results.

HERE I HAVE DISCUSSED THE VISUAL REPRESENTATION OF THE

ENCODER——>ATTENTION——–>DECODER PART AND ITS MATHEMATICS .

FURTHER WHAT WE SEE ABOVE IS GLOBAL ATTENTION , ANOTHER APRROACH IS LOCAL ATTENTION WHERE INSTEAD OF LOOKING AT THE ENTIRE SENTANCE WE MIGHT BE INTERESTED IN A WINDOW OF WORDS . NOW AGAIN THAT WOULD INCREASE A HYPERPARAMETER 😛 .

FURTHER STEPS ARE DEFINING OPTIMIZER, LOSS FUNCTIONS AND USING A METHOD CALLED “TEACHER FORCING ” TO TRAIN THE MODEL . FOR FURTHER READING REFER :https://www.tensorflow.org/tutorials/text/nmt_with_attention

Add a Comment

You must be logged in to post a comment