c75d1c5a85d5badb836cb900a0dff8cd

The Fault In “GPT-3”. IS IT A HYPE?

IS GPT-3 A HYPE? WHAT MEDIA SAYS AND IS IT IMPORTANT FOR INTERVIEWS?

If you are familiar with the concept of attention and transformers , you must have come across the word “GPT” . Be it the earlier models like GPT-1 and GPT-2 or the recently released GPT-3.

GPTs are decoder only stacks (generative -pre trained models) developed by OPEN AI

Ever since GPT-3 was released platforms like twitter were flooded with posts that glorified the model and what it can do . The posts were written in a manner which would make any layman person perceive it as some sort of magic . Funny claims like “this is the end of software engineering” were made .

GPT-3 in fact is a milestone in NLP as it showed performance like never before . But one needs to understand the limitations and the reasons for such performance . Finally one can see that GPT-3 is far away from being labelled as “near to human intelligence”.

CHANGES FROM GPT-2 AND GPT -1

below you can see the architecture of GPT-1 model (one transformer decoder) .

Further enhancements by varying layers and parameters led to GPT-2

ref: https://jalammar.github.io/illustrated-gpt2/

GPT-3 is structurally similar to what GPT-2 is. The main advancements are the result of an extremely large number of parameters that were used in training the model . Also the computing resources that were used were way more than any “normal ” research group can afford .

GPT-2 VS GPT-3 PARAMETERS COMPARISION

MODEL NUMBER OF PARAMETERSnumber of layersbatch size
GPT-21.5 B48512
GPT-3 SMALL125M120.5 M
GPT-3 MEDIUM350M240.5 M
GPT-3 LARGE760 M240.5M
GPT -3 6.7 B6.7 B322 M
GPT-3 13B13.0 B402 M
GPT -3 175B OR ” GPT-3″175.0 B963.2 M
. All models (gpt-3) were trained for a total of 300 billion tokens. DATA SET: COMMON CRAWL

MAJORITY OF THE PERFORMANCE BENEFITS CAN BE SEEN COMING FROM THE ENORMOUSLY HUGE NUMBER OF PARAMETERS .

IS IT SCALABLE ?

Well if you are thinking to train a gpt-3 model from scratch , you might need to think twice . Even for OPEN AI , the cost of training GPT-3 was close to $4.6 million . And at present computing costs training gpt 4 r gpt 8 might be too expensive even for such huge organizations .

THE NEGATIVE BIAS

Given GPT-3 was trained on common crawl data of the internet , the model was prone to “learn ” social bias against woman , black people and the hate comments that is present in abundance on the internet. Its not surprising these days to find two people cussing and fighting over any social media platform ,sad.

ALWAYS RIGHT?

GPT-3 fails tasks which are very problem specific. You can expect it to understand and answer common daily life questions( even then there is no guarantee of cent percent accuracy. ) but it cant answer very specific medical case questions . Also there is no “fact checking mechanism ” that can ensure that the output is not not only semantically correct but is also correct as a matter of fact.

GPT FOR VISION ?

Direct implementation of transformers isn’t feasible considering the dimensionality of an image and train time complexity of a transformer . Even for people/organizations with huge computation power its overwhelming.

A RECENTLY PUBLISHED PAPER ” AN IMAGE IS WORTH 16*16 WORDS” HAS SHOWN TO USE TRANSFORMERS FOR CV TASKS . DO CHECK OUT THIS LINK:

ACCESS RIGHTS

. At the moment, not everyone can get access to it. OpenAI wants to ensure that no one misuses it. This certainly has raised some questions in the AI community and is debatable .

WHERE CAN YOU SEE A DEMO? CHECK THIS OUT !

MEDIA HYPE?

YES!!! any model till now is just miles away from achieving general intelligence . Even the research team of GPT-3 has clearly asked the media to not create a “FAKE BUZZ” and that even though this is a milestone for sure but it is not general intelligence and can make errors .

FOR INTERVIEWS?

Given the access rights, the fact that you cannot train it , and even if you can it just would be a library implimentation like BERT , its expected only to know the theoretical part if you mention it in your resume.

šŸ˜›

LINK TO GPT-3 RESEARCH PAPER : https://arxiv.org/pdf/2005.14165.pdf

photo-1566913485242-694e995731b4

ML INTERVIEW QUESTIONS -PART 1

IN THESE INTERVIEW PREP SERIES WE LOOK AT IMPORTANT INTERVIEW QUESTIONS ASKED IN DATA SCIENTIST /ML AND DL ROLES .

IN EACH PART WE WILL DISCUSS FEW ML INTERVIEW QUESTIONS.

1) What is PDF?

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable. PDF for an interval indicates the probability of the random variable falling within the interval. 

2) What is Confidence Interval?

confidence interval displays the probability that a parameter will fall between a pair of values around the meanConfidence intervals measure the degree of uncertainty or certainty in a sampling method.

3) Can KL divergence be used as a distance measure?

No. It is not a metric measure as it is not symmetric.

4) What is Log-normal distribution? 

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

5) What is Spearman Rank Correlation Coefficient?

Spearman Rank Correlation Coefficient is determined by applying Pearson Coefficient on rank encoded random variables.

6) Why is ā€œNaiveā€ Bayes naive?

The conditional independence of the variables of a data frame is an assumption in Naive Bayes which can never be true in practice. The conditional independence assumption is made to simplify the computations of the conditional probabilities. Naive Bayes is naive due to this assumption.

7)What is the ā€œCrowding problemā€ in t-sne? 

This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map.

8)What are the limitations of PCA?

PCA should be used mainly for variables which are strongly correlated.

If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. 

PCA Results Are Difficult To Interpret Clearly.

9)Name 2 failure cases of KNN?

When query point is an outlier or when the data is extremely random and has no information.

10) Name 4 assumptions of linear regression

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

11)Why are log probabilities used in Naive -bayes?

The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

12)How to handle Numerical features in(Gaussian NB)?Ā 

Numerical features are assumed to be Gaussian. Probabilities are determined by considering the distribution of the data points belonging to different classes separately.

13)How do you get a feature important in naive Bayes?

TheĀ naive bayesĀ classifers don’t offer an intrinsic method to evaluateĀ featureĀ importances.Ā NaĆÆve BayesĀ methods work by determining the conditional and unconditional probabilities associated with theĀ featuresĀ and predict the class with the highest probability.

14)Differentiate between GD and SGD.

n both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

In SGD only one data point is used per iteration to calculate the value of the loss function. While for GD all the data points are used to calculate the value of the loss function

15)Do you know the train and run time complexities for a SVM model?

Train time complexity O(n2)

Run time complexity O(k*d)

k=number of support vectors, d=dimensionality of data set

16)Why is RBF kernel SVM compared to kNN?

They are not that similar, but they are related though. The point is, that both kNN and RBF are non-parametric methods to estimate the density of probability of your data.

Notice that this two algorithm approach the same problem differently: kernel methods fix the size of the neighborhood (h) and then calculateĀ K, whereas kNN fixes the number of points,Ā K, and then determines the region in space which contain those points.

RBF KERNEL

17)What decides overfitting and underfitting in DT?

the max_depth parameter decides the overfitting and underfitting in Decision Trees.

18)What is Non negative Matrix Factorization?

decomposing a matrix into 2 smaller matrices with all elements greater than zero and whose product gives us the original matrix.

19)what is Netflix prize problem?

TheĀ Netflix PrizeĀ was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for theĀ contest.

20) What are word embeddings?

Word embeddings are a type of word representation IN A VECTOR SPACE that allows words with similar meaning to have a similar representation.

photo-1605773433673-f46852ff81c9

THE WORD2VEC MODEL (CBOW /SKIP-GRAM)

Understanding Word embeddings ,Semantics , and Word2vec model methods like skipgram and CBOW.

WHY embeddings?

remember machines can only operate on numbers . approaches like Bag of words ,tfidf provide a solution for converting texts to numbers , but not without many drawbacks . here are a few:

  1. as the dimensionality(vocab-size) increases , it becomes memory inefficient.
  2. Sine all the word representations are orthogonal , there is no semantic relation between words . That is the amount of separation between words like “apple” and “orange” is same as “apple” and “door”.

Therefore the semantic relation between words, which is really valuable information , is lost. We must come up with techniques which along with converting texts to numbers ,keeps in mind the space efficiency and also preserves the semantic relation among words .

EMBEDDINGS ARE USED IN MANY DEEP LEARNING NLP TASKS LIKE SEQ2SEQ MODELS, TRANSFORMERS etc.

WORD2vec

before understanding how it does the job we discussed above , lets see what it actually is composed of .

Word2vec is 2 layer neural network that helps to generate word embeddings from a given text corpus in a given dimension

So essentially word2vec helps us create numerical mapping of words in vector space , remember , this dimensionality can be less than the vocab -size.

so what is the wor2vec model try to do:

it tries to make similar embeddings for words that occur in similar context.

it tries to achieve the above using 2 approaches algorithms :

  1. CBOW( continuous bag of words)
  2. Skip-gram

CBOW and SKIP-GRAM

CBOW :A MODEL THAT TRIES TO PREDICT THE TARGET WORD FROM THE GIVEN CONTEXT WORDS

CBOW ALGORITHM REPRESENTATION

SKIPGRAM: A MODEL THAT TRIES TO PREDICT THE CONTEXT WORDS FROM THE GIVEN TARGET WORD

here is how we define a loss function : p(wt+j|wt) for skipgram ; p(wt|wt+j) for cbow:

lets break it down. Suppose we define “context” to be a window of m words , we iterate over all such windows from the beginning of our sentence and try to maximize the conditional probability of a certain words occurring given a context word or vice versa.

THE WORKING

  1. LETS SAY THAT YOUR VOCAB SIZE IS V .
  2. WE PERFORM ONE-HOT ENCODING FOR ALL WORDS .
  3. WE CHOOSE OUR WINDOW SIZE AS C.
  4. WE DECIDE THE DIMENSON OF THE EMBEDDING DIMENSION , LET SAY N .OUR HIDDEN LAYER WILL HAVE N NEURONS.
  5. THE OUTPUT IS A SOFTMAX LAYER OF DIMENSION V.
  6. WITH THESE THINGS WE CREATE THE BELOW ARCHITECTURE.
cbow word2vec
CBOW WORD2VEC ACHITECTURE

WE PERFORM THE TRAINING UPDATE THE WEIGHTS , AFTER TRAINING THE WEIGHT BETWEEN HIDDEN LAYER AND OUTPUT SOFT-MAX IS OUR WORD EMBEDDING MATRIX . NOTE ITS DIMENSION IS (OUR SELECTED LOWER DIMENSION)*(VOCAB-SIZE) . HENCE IT CONTAINS THE N-DIMENSIONAL VECTOR SPACE REPRESENTATION OF ALL THE WORDS IN THE VOCABULARY.

THE WORKING OF SKIP GRAM IS JUST THE OPPOSSITE . THERE WE USE ONE WORD TO PREDICT THE CONTEXT WORDS.

SKIPGRAM WORD2VEC ACHITECTURE

NOW THE FINAL STEP, TO OBTAIN WORD VECTOR OF ANY WORD, WE TAKLE THE EMBEDDING MATRIX THAT WE TRAINED AND MULTILY IT BY ITS ONE HOT-ENCODING REPRESENTATION.

THATS IT!

ITS ALWAYS HELPFUL TO REMEMBER THE LOSS FUNCTION MENTIONED ABOVE IN INTERVIEWS!

artificial data generation

BATCH NORMALISATION IN NEURAL NETWORKS

HOW BATCH NORMALISATION SPEEDS UP TRANING, HELPS IN SCALING AND ALSO ACTS AS A REGULARIZER IN NEURAL NETWORKS

We know that normalization and feature scaling help in achieving faster training and convergence .But why is that the case . Normalization makes our cost function symmetric in all variables . hence we do not have to worry about a certain learning rate being too little in cetain directions and too overwhelming in other .

Hence we can use sufficiently good learning rates to converge faster .THERE ARE ADAPTIVE LEARNING RATE OPTIMIZERS LIKE ADAM BUT STILL NORMALIZATION HELPS. Also we know that in neural networks data points are passed in batches .

The idea behind batch normalization is to normalize all the activations in every layer w.r.t the current batch data .

WHY BATCH NORMALISATION?

  1. SPEEDS UP TRAINING.
  2. REDUCES IMPORTNCE OF WEIGHT INITIALISATION (BECAUSE COSTFUNCTION IS SMOOTHENED OUT)
  3. CAN BE THOUGHT OF AS REGULARIZERS

HOW IT CAN BE ASSOCIATED TO REGULARIZATION?

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization applies to objective functions in ill-posed optimization problems.

WIKIPEDIA

NOTICE HOW WHEN NORMALIZING THE ACTIVATIONS , THE MEAN AND SIGMA VARY WITH EACH BATCH THAT PASSES , THIS CREATES SOME “RANDOMNESS” AS THE NEW UNSEEN BATCH WILL HAVE A DIFFERENT MEAN AND SIGMA VALUES . THIS CAN BE THOUGHT OF AS REGULARIZATION AS IT HELPS THE MODEL TO NOT OVERFIT TO A CERTAIN DISTRIBUTION .

REMEMBER THAT BATCH REGULARIZATION IS IMPLEMENTED WITH DROPOUTS .

IMPLEMENTATION OF BATCH NORMALIZATION IN KERAS

since mean and variance can vary greatly from batch to batch there needs to be some caliberation over these two parameters. Hence we introduce 2 learnable parameters for the purpose

NUMBER OF TUNABLE PARAMETER IN BATCH NORMALIZATION IS 2 . gamma and beta namely.(gamma for caliberating mean , beta for variance)

THE CITATION IN THE RESEARCH PAPER SHOWN IN THE VERY BEGINNING

IMPORTANT INTERVIEW QUESTION REGARDING BATCH NORMALISATION LAYER:

how does it work differently during training an testing?

Following is the answer from the official documentation page from keras :

photo-1518621736915-f3b1c41bfd00

IMPLEMENTING ATTENTION MECHANISM FROM SCRATCH

how attention mechanism layer is an upgrade from one context vector seq to seq models

I MEAN WHO DOESN’T CRAVE A LITTLE ATTENTION ? IT ONLY HELPS SO MUCH . MACHINE TRANSLATION AND NEURAL NETS HAVE A LONG HISTORY . BUT THE “ATTENTION MECHANISM ” IS NOT VERY OLD . APPLYING “ATTENTION” -AN ANALOGY TO HOW HUMANS READ AND PERCIEVE TEXT/SEQUENTIAL INFORMATION HAS HELPED IN ACHIEVING BETTER RESULTS IN THE FIELD OF MACHINE TRANSLATION .

HERE WE TRY TO UNDERSTAND THE IMPLEMENTATION OF AN ATTENTION LAYER , WHICH IF USED (AS AN IMPORTED PACKAGE/LIBRARY) IS A ONE LINER , BUT IF IMPLEMENTED FROM SCRATCH REQUIRES SOME WORK.

WE ALREADY SAW HOW IN SIMPLE ENCODER TO DECODER MODELS , A SINGLE CONTEXT VECTOR IS WHAT CARRIES THE “SUMMARISED INFO” OF THE ENTIRE INPUT SEQUENCE .

FOLLOWING IS THE INTUITION OF ATTENTION MECHANISM WITHOUT ANY MATH :

HUMANS WHEN TRYING TO TRANSLATE OR SUMMARIZE ANY SENTENCE TEND TO NOTICE THE ENTIRE SENTENCE AND KEEP A NOTICE OF HOW ALL THE WORDS , A SARCASM , ANY REFERENCES ARE RELATED TO EACH OTHER . SO , BASICALLY HOW WORDS TOGETHER ARE RESPONSIBLE FOR THE FINAL TRANSLATION .

I’LL ASSUME THAT THE READER IS ALREADY FAMILIAR WITH WORD EMBEDDINGS ,TOKENISATION,DENSE LAYERS AND EMBEDDING MATRIX.

SO OUR PROBLEM STATEMENT IS FRECH TO ENGLISH TRANSLATION

WE PERFORM TOKENISATION , CREATE EMBEDDING MATRIX AND NOW WE ARE WANTING TO ADD AN ATTENTION LAYER IN BETWEEN THE INPUT AND OUTPUT SEQUENCES . I HOPE THE INTUITION IS CLEAR .

INSTEAD OF PASSING ONE CONTEXT VECTOR, WE WANT OUR MODEL TO SEE ALL THE STATES , AND DECIDE WHAT FEATURES ARE MORE IMPORATNT DURING TRANSLATION AT EACH STAGE IN THE DECODER.

ITS TIME TO VISUALIZE THE ABOVE STATEMENT . HAVE A LOOK :

IMAGE REF: https://sknadig.dev/basics-attention/

LETS BREAK IT DOWN , WHAT YOU SEE IS THE LSTMS (BIDIRECTIONAL ENCODER) CREATING THEIR RESPECTIVE STATES , AND INSTEAD OF PASSING THE LAST STATE WE ARE PASSING WEIGHTED STATES , WEIGHTED BY FACTORS ALPHA (LEARNABLE PARAMETER ),DIFFERENT FOR EACH STATE , THIS WEIGHTED CONTEXT VECTOR +THE PRESENT DECODER STATE TOGETHER DECIDE THE NEXT STATE AND OUTPUT .

THE ABOVE PARA COMPLETELY SUMMARISES THE INTUITION . NOW LETS SEE SOME MATH . THE FINAL CONTEXT VECTOR IS THE SUM OF THE WEIGHTED HIDDEN STATES . THE OTHER CONDITION BEING THAT THE ALPHAS ARE NORMALISED TO ONE . NOW HOW TO DECIDE THE ALPHAS ?

WELL IT TURNS OUT, ” WHAT BETTER THAN A NEURAL NETWORK TO DECIDE AN APPROXIMATE FUNCTION” , HENCE :

THIS COMPTIBALITY FUNCTION(WHICH IS LTRAINABLE NETWORK IS OF DIFFERENT TYPES )

HERE WE DISCUSSION BAHDANAU ATTENTION MECHANISM (LOUNG BEING THE OTHER). LETS START BY A LITTLE CODE BEFORE GOING INTO FURTHER DETAILS . SO THE FIRST THING WE NEED IS AN ENCODER ,

ENCODER

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

WHAT WE DID ABOVE WAS JUST MAKE AN ENCODER CLASS , AND MAKE ONE OBJECT FROM THAT CLASS , ( YOU CAN USE GRU/LSTM ) , INPUT VOCAB SIZE= VOCAB SIZE OBTAINED AFTER TOKENISING YOUR DATA , THE CODE IS REFERENCED FROM TENSORFLOW NEURAL MACHINE TRANSLATION .

NOW IN BETWEEN ENCODER AND DECODER WE INTRODUCE AN ATTENTION LAYER . JUST LIKE ABOVE WE CODE A CLASS AND MAKE AN OBJECT.

THIS IS BASICALLY THE COMPATIBILITY FUNCTION AND THE ” NUMBER OF ATTENTION UNITS” DECIDES THE NUMBER OF UNITS IN DENSE LAYERS(RESPECTIVE OF THE SCORING FUNCTION USED) THAT YOU WILL INITIALIZE FOR TRAINING .HENCE ITS A HYPER PARAMETER.

BAHDANAU ATTENTION mechanism

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # query hidden state shape == (batch_size, hidden size)
    # query_with_time_axis shape == (batch_size, 1, hidden size)
    # values shape == (batch_size, max_len, hidden size)
    # we are doing this to broadcast addition along the time axis to calculate the score
    query_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
attention_layer = BahdanauAttention(10)

THE “SCORE” FUNCTION USED IN THE CALL FUNCTION HERE IS “CONCAT” . IN TOTAL THERE ARE 3 VARIETIES OF SCORING FUNCTIONS THAT ARE USED .

  1. CONCAT
  2. DOT
  3. GENERAL

DEPENDING ON THE SCORING FUNCTIONS WE INITIALIZE OUR PARAMETERS IN THE ATTENTION CLASS. FOR THE DIFFERENT SCORING FUNCTIONS REFER THIS.

NOW WE WILL DEFIN E OUR DECODER CLASS , NOTICE HOW WE USE ATTENTION OBJECT WITHIN THE DFECODER CLASS . THIS ATTENTION TAKES INPUT FROM THE ENCODER STATES , PERFORMS THE “ATTENTON MECHANISM” OPERATION AND THEN WE DO THE “DECODING” PART . IT RETURNS THE ATTENTION WEIGHTS AND OUTPUT STATE .

DECODER

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

WHERE vocab_tar_size IS THE VOCAB SIZE AFTER TOKENISATION OF TARGET LANGUAGE .

SO THE FINAL PICTURE IS SOMEWHAT LIKE (DOT BASED ATTENTION):

  1. THE e’s sre supplying the normalised alphas , the alphas are performing the weighting operation , and together with present decoder state giving us the final results.

HERE I HAVE DISCUSSED THE VISUAL REPRESENTATION OF THE

ENCODER——>ATTENTION——–>DECODER PART AND ITS MATHEMATICS .

FURTHER WHAT WE SEE ABOVE IS GLOBAL ATTENTION , ANOTHER APRROACH IS LOCAL ATTENTION WHERE INSTEAD OF LOOKING AT THE ENTIRE SENTANCE WE MIGHT BE INTERESTED IN A WINDOW OF WORDS . NOW AGAIN THAT WOULD INCREASE A HYPERPARAMETER šŸ˜› .

FURTHER STEPS ARE DEFINING OPTIMIZER, LOSS FUNCTIONS AND USING A METHOD CALLED “TEACHER FORCING ” TO TRAIN THE MODEL . FOR FURTHER READING REFER :https://www.tensorflow.org/tutorials/text/nmt_with_attention