how attention mechanism layer is an upgrade from one context vector seq to seq models
I MEAN WHO DOESN’T CRAVE A LITTLE ATTENTION ? IT ONLY HELPS SO MUCH . MACHINE TRANSLATION AND NEURAL NETS HAVE A LONG HISTORY . BUT THE “ATTENTION MECHANISM ” IS NOT VERY OLD . APPLYING “ATTENTION” -AN ANALOGY TO HOW HUMANS READ AND PERCIEVE TEXT/SEQUENTIAL INFORMATION HAS HELPED IN ACHIEVING BETTER RESULTS IN THE FIELD OF MACHINE TRANSLATION .
HERE WE TRY TO UNDERSTAND THE IMPLEMENTATION OF AN ATTENTION LAYER , WHICH IF USED (AS AN IMPORTED PACKAGE/LIBRARY) IS A ONE LINER , BUT IF IMPLEMENTED FROM SCRATCH REQUIRES SOME WORK.
WE ALREADY SAW HOW IN SIMPLE ENCODER TO DECODER MODELS , A SINGLE CONTEXT VECTOR IS WHAT CARRIES THE “SUMMARISED INFO” OF THE ENTIRE INPUT SEQUENCE .
FOLLOWING IS THE INTUITION OF ATTENTION MECHANISM WITHOUT ANY MATH :
HUMANS WHEN TRYING TO TRANSLATE OR SUMMARIZE ANY SENTENCE TEND TO NOTICE THE ENTIRE SENTENCE AND KEEP A NOTICE OF HOW ALL THE WORDS , A SARCASM , ANY REFERENCES ARE RELATED TO EACH OTHER . SO , BASICALLY HOW WORDS TOGETHER ARE RESPONSIBLE FOR THE FINAL TRANSLATION .
I’LL ASSUME THAT THE READER IS ALREADY FAMILIAR WITH WORD EMBEDDINGS ,TOKENISATION,DENSE LAYERS AND EMBEDDING MATRIX.
SO OUR PROBLEM STATEMENT IS FRECH TO ENGLISH TRANSLATION
WE PERFORM TOKENISATION , CREATE EMBEDDING MATRIX AND NOW WE ARE WANTING TO ADD AN ATTENTION LAYER IN BETWEEN THE INPUT AND OUTPUT SEQUENCES . I HOPE THE INTUITION IS CLEAR .
INSTEAD OF PASSING ONE CONTEXT VECTOR, WE WANT OUR MODEL TO SEE ALL THE STATES , AND DECIDE WHAT FEATURES ARE MORE IMPORATNT DURING TRANSLATION AT EACH STAGE IN THE DECODER.
ITS TIME TO VISUALIZE THE ABOVE STATEMENT . HAVE A LOOK :

IMAGE REF: https://sknadig.dev/basics-attention/
LETS BREAK IT DOWN , WHAT YOU SEE IS THE LSTMS (BIDIRECTIONAL ENCODER) CREATING THEIR RESPECTIVE STATES , AND INSTEAD OF PASSING THE LAST STATE WE ARE PASSING WEIGHTED STATES , WEIGHTED BY FACTORS ALPHA (LEARNABLE PARAMETER ),DIFFERENT FOR EACH STATE , THIS WEIGHTED CONTEXT VECTOR +THE PRESENT DECODER STATE TOGETHER DECIDE THE NEXT STATE AND OUTPUT .
THE ABOVE PARA COMPLETELY SUMMARISES THE INTUITION . NOW LETS SEE SOME MATH . THE FINAL CONTEXT VECTOR IS THE SUM OF THE WEIGHTED HIDDEN STATES . THE OTHER CONDITION BEING THAT THE ALPHAS ARE NORMALISED TO ONE . NOW HOW TO DECIDE THE ALPHAS ?

WELL IT TURNS OUT, ” WHAT BETTER THAN A NEURAL NETWORK TO DECIDE AN APPROXIMATE FUNCTION” , HENCE :


THIS COMPTIBALITY FUNCTION(WHICH IS LTRAINABLE NETWORK IS OF DIFFERENT TYPES )
HERE WE DISCUSSION BAHDANAU ATTENTION MECHANISM (LOUNG BEING THE OTHER). LETS START BY A LITTLE CODE BEFORE GOING INTO FURTHER DETAILS . SO THE FIRST THING WE NEED IS AN ENCODER ,
ENCODER
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
WHAT WE DID ABOVE WAS JUST MAKE AN ENCODER CLASS , AND MAKE ONE OBJECT FROM THAT CLASS , ( YOU CAN USE GRU/LSTM ) , INPUT VOCAB SIZE= VOCAB SIZE OBTAINED AFTER TOKENISING YOUR DATA , THE CODE IS REFERENCED FROM TENSORFLOW NEURAL MACHINE TRANSLATION .
NOW IN BETWEEN ENCODER AND DECODER WE INTRODUCE AN ATTENTION LAYER . JUST LIKE ABOVE WE CODE A CLASS AND MAKE AN OBJECT.
THIS IS BASICALLY THE COMPATIBILITY FUNCTION AND THE ” NUMBER OF ATTENTION UNITS” DECIDES THE NUMBER OF UNITS IN DENSE LAYERS(RESPECTIVE OF THE SCORING FUNCTION USED) THAT YOU WILL INITIALIZE FOR TRAINING .HENCE ITS A HYPER PARAMETER.
BAHDANAU ATTENTION mechanism
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
# query hidden state shape == (batch_size, hidden size)
# query_with_time_axis shape == (batch_size, 1, hidden size)
# values shape == (batch_size, max_len, hidden size)
# we are doing this to broadcast addition along the time axis to calculate the score
query_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
score = self.V(tf.nn.tanh(
self.W1(query_with_time_axis) + self.W2(values)))
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
attention_layer = BahdanauAttention(10)
THE “SCORE” FUNCTION USED IN THE CALL FUNCTION HERE IS “CONCAT” . IN TOTAL THERE ARE 3 VARIETIES OF SCORING FUNCTIONS THAT ARE USED .
- CONCAT
- DOT
- GENERAL

DEPENDING ON THE SCORING FUNCTIONS WE INITIALIZE OUR PARAMETERS IN THE ATTENTION CLASS. FOR THE DIFFERENT SCORING FUNCTIONS REFER THIS.
NOW WE WILL DEFIN E OUR DECODER CLASS , NOTICE HOW WE USE ATTENTION OBJECT WITHIN THE DFECODER CLASS . THIS ATTENTION TAKES INPUT FROM THE ENCODER STATES , PERFORMS THE “ATTENTON MECHANISM” OPERATION AND THEN WE DO THE “DECODING” PART . IT RETURNS THE ATTENTION WEIGHTS AND OUTPUT STATE .
DECODER
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
WHERE vocab_tar_size IS THE VOCAB SIZE AFTER TOKENISATION OF TARGET LANGUAGE .
SO THE FINAL PICTURE IS SOMEWHAT LIKE (DOT BASED ATTENTION):
- THE e’s sre supplying the normalised alphas , the alphas are performing the weighting operation , and together with present decoder state giving us the final results.

HERE I HAVE DISCUSSED THE VISUAL REPRESENTATION OF THE
ENCODER——>ATTENTION——–>DECODER PART AND ITS MATHEMATICS .
FURTHER WHAT WE SEE ABOVE IS GLOBAL ATTENTION , ANOTHER APRROACH IS LOCAL ATTENTION WHERE INSTEAD OF LOOKING AT THE ENTIRE SENTANCE WE MIGHT BE INTERESTED IN A WINDOW OF WORDS . NOW AGAIN THAT WOULD INCREASE A HYPERPARAMETER 😛 .
FURTHER STEPS ARE DEFINING OPTIMIZER, LOSS FUNCTIONS AND USING A METHOD CALLED “TEACHER FORCING ” TO TRAIN THE MODEL . FOR FURTHER READING REFER :https://www.tensorflow.org/tutorials/text/nmt_with_attention