IS GPT-3 A HYPE? WHAT MEDIA SAYS AND IS IT IMPORTANT FOR INTERVIEWS?
If you are familiar with the concept of attention and transformers , you must have come across the word “GPT” . Be it the earlier models like GPT-1 and GPT-2 or the recently released GPT-3.
GPTs are decoder only stacks (generative -pre trained models) developed by OPEN AI
Ever since GPT-3 was released platforms like twitter were flooded with posts that glorified the model and what it can do . The posts were written in a manner which would make any layman person perceive it as some sort of magic . Funny claims like “this is the end of software engineering” were made .
GPT-3 in fact is a milestone in NLP as it showed performance like never before . But one needs to understand the limitations and the reasons for such performance . Finally one can see that GPT-3 is far away from being labelled as “near to human intelligence”.
CHANGES FROM GPT-2 AND GPT -1
below you can see the architecture of GPT-1 model (one transformer decoder) .
Further enhancements by varying layers and parameters led to GPT-2
GPT-3 is structurally similar to what GPT-2 is. The main advancements are the result of an extremely large number of parameters that were used in training the model . Also the computing resources that were used were way more than any “normal ” research group can afford .
GPT-2 VS GPT-3 PARAMETERS COMPARISION
NUMBER OF PARAMETERS
number of layers
GPT -3 6.7 B
GPT -3 175B OR ” GPT-3″
. All models (gpt-3) were trained for a total of 300 billion tokens. DATA SET: COMMON CRAWL
MAJORITY OF THE PERFORMANCE BENEFITS CAN BE SEEN COMING FROM THE ENORMOUSLY HUGE NUMBER OF PARAMETERS .
IS IT SCALABLE ?
Well if you are thinking to train a gpt-3 model from scratch , you might need to think twice . Even for OPEN AI , the cost of training GPT-3 was close to $4.6 million . And at present computing costs training gpt 4 r gpt 8 might be too expensive even for such huge organizations .
THE NEGATIVE BIAS
Given GPT-3 was trained on common crawl data of the internet , the model was prone to “learn ” social bias against woman , black people and the hate comments that is present in abundance on the internet. Its not surprising these days to find two people cussing and fighting over any social media platform ,sad.
GPT-3 fails tasks which are very problem specific. You can expect it to understand and answer common daily life questions( even then there is no guarantee of cent percent accuracy. ) but it cant answer very specific medical case questions . Also there is no “fact checking mechanism ” that can ensure that the output is not not only semantically correct but is also correct as a matter of fact.
GPT FOR VISION ?
Direct implementation of transformers isn’t feasible considering the dimensionality of an image and train time complexity of a transformer . Even for people/organizations with huge computation power its overwhelming.
A RECENTLY PUBLISHED PAPER ” AN IMAGE IS WORTH 16*16 WORDS” HAS SHOWN TO USE TRANSFORMERS FOR CV TASKS . DO CHECK OUT THIS LINK:
. At the moment, not everyone can get access to it. OpenAI wants to ensure that no one misuses it. This certainly has raised some questions in the AI community and is debatable .
WHERE CAN YOU SEE A DEMO? CHECK THIS OUT !
YES!!! any model till now is just miles away from achieving general intelligence . Even the research team of GPT-3 has clearly asked the media to not create a “FAKE BUZZ” and that even though this is a milestone for sure but it is not general intelligence and can make errors .
Given the access rights, the fact that you cannot train it , and even if you can it just would be a library implimentation like BERT , its expected only to know the theoretical part if you mention it in your resume.
IN THESE INTERVIEW PREP SERIES WE LOOK AT IMPORTANT INTERVIEW QUESTIONS ASKED IN DATA SCIENTIST /ML AND DL ROLES .
IN EACH PART WE WILL DISCUSS FEW ML INTERVIEW QUESTIONS.
1) What is PDF?
Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable. PDF for an interval indicates the probability of the random variable falling within the interval.
2) What is Confidence Interval?
A confidence interval displays the probability that a parameter will fall between a pair of values around the mean. Confidence intervals measure the degree of uncertainty or certainty in a sampling method.
3) Can KL divergence be used as a distance measure?
No. It is not a metric measure as it is not symmetric.
Spearman Rank Correlation Coefficient is determined by applying Pearson Coefficient on rank encoded random variables.
6) Why is “Naive” Bayes naive?
The conditional independence of the variables of a data frame is an assumption in Naive Bayes which can never be true in practice. The conditional independence assumption is made to simplify the computations of the conditional probabilities. Naive Bayes is naive due to this assumption.
7)What is the “Crowding problem” in t-sne?
This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map.
8)What are the limitations of PCA?
PCA should be used mainly for variables which are strongly correlated.
If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine.
PCA Results Are Difficult To Interpret Clearly.
9)Name 2 failure cases of KNN?
When query point is an outlier or when the data is extremely random and has no information.
10) Name 4 assumptions of linear regression
No or little multicollinearity
11)Why are log probabilities used in Naive -bayes?
The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.
12)How to handle Numerical features in(Gaussian NB)?
Numerical features are assumed to be Gaussian. Probabilities are determined by considering the distribution of the data points belonging to different classes separately.
13)How do you get a feature important in naive Bayes?
The naive bayes classifers don’t offer an intrinsic method to evaluate feature importances. Naïve Bayes methods work by determining the conditional and unconditional probabilities associated with the features and predict the class with the highest probability.
14)Differentiate between GD and SGD.
n both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
In SGD only one data point is used per iteration to calculate the value of the loss function. While for GD all the data points are used to calculate the value of the loss function
15)Do you know the train and run time complexities for a SVM model?
Train time complexity O(n2)
Run time complexity O(k*d)
k=number of support vectors, d=dimensionality of data set
16)Why is RBF kernel SVM compared to kNN?
They are not that similar, but they are related though. The point is, that both kNN and RBF are non-parametric methods to estimate the density of probability of your data.
Notice that this two algorithm approach the same problem differently: kernel methods fix the size of the neighborhood (h) and then calculate K, whereas kNN fixes the number of points, K, and then determines the region in space which contain those points.
17)What decides overfitting and underfitting in DT?
the max_depth parameter decides the overfitting and underfitting in Decision Trees.
18)What is Non negative Matrix Factorization?
decomposing a matrix into 2 smaller matrices with all elements greater than zero and whose product gives us the original matrix.
19)what is Netflix prize problem?
The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.
20) What are word embeddings?
Word embeddings are a type of word representation IN A VECTOR SPACE that allows words with similar meaning to have a similar representation.
Understanding Word embeddings ,Semantics , and Word2vec model methods like skipgram and CBOW.
remember machines can only operate on numbers . approaches like Bag of words ,tfidf provide a solution for converting texts to numbers , but not without many drawbacks . here are a few:
as the dimensionality(vocab-size) increases , it becomes memory inefficient.
Sine all the word representations are orthogonal , there is no semantic relation between words . That is the amount of separation between words like “apple” and “orange” is same as “apple” and “door”.
Therefore the semantic relation between words, which is really valuable information , is lost. We must come up with techniques which along with converting texts to numbers ,keeps in mind the space efficiency and also preserves the semantic relation among words .
before understanding how it does the job we discussed above , lets see what it actually is composed of .
Word2vec is 2 layer neural network that helps to generate word embeddings from a given text corpus in a given dimension
So essentially word2vec helps us create numerical mapping of words in vector space , remember , this dimensionality can be less than the vocab -size.
so what is the wor2vec model try to do:
it tries to make similar embeddings for words that occur in similar context.
it tries to achieve the above using 2 approaches algorithms :
CBOW( continuous bag of words)
CBOW and SKIP-GRAM
CBOW :A MODEL THAT TRIES TO PREDICT THE TARGET WORD FROM THE GIVEN CONTEXT WORDS
SKIPGRAM: A MODEL THAT TRIES TO PREDICT THE CONTEXT WORDS FROM THE GIVEN TARGET WORD
here is how we define a loss function : p(wt+j|wt) for skipgram ; p(wt|wt+j) for cbow:
lets break it down. Suppose we define “context” to be a window of m words , we iterate over all such windows from the beginning of our sentence and try to maximize the conditional probability of a certain words occurring given a context word or vice versa.
LETS SAY THAT YOUR VOCAB SIZE IS V .
WE PERFORM ONE-HOT ENCODING FOR ALL WORDS .
WE CHOOSE OUR WINDOW SIZE AS C.
WE DECIDE THE DIMENSON OF THE EMBEDDING DIMENSION , LET SAY N .OUR HIDDEN LAYER WILL HAVE N NEURONS.
THE OUTPUT IS A SOFTMAX LAYER OF DIMENSION V.
WITH THESE THINGS WE CREATE THE BELOW ARCHITECTURE.
WE PERFORM THE TRAINING UPDATE THE WEIGHTS , AFTER TRAINING THE WEIGHT BETWEEN HIDDEN LAYER AND OUTPUT SOFT-MAX IS OUR WORD EMBEDDING MATRIX . NOTE ITS DIMENSION IS (OUR SELECTED LOWER DIMENSION)*(VOCAB-SIZE) . HENCE IT CONTAINS THE N-DIMENSIONAL VECTOR SPACE REPRESENTATION OF ALL THE WORDS IN THE VOCABULARY.
THE WORKING OF SKIP GRAM IS JUST THE OPPOSSITE . THERE WE USE ONE WORD TO PREDICT THE CONTEXT WORDS.
NOW THE FINAL STEP, TO OBTAIN WORD VECTOR OF ANY WORD, WE TAKLE THE EMBEDDING MATRIX THAT WE TRAINED AND MULTILY IT BY ITS ONE HOT-ENCODING REPRESENTATION.
ITS ALWAYS HELPFUL TO REMEMBER THE LOSS FUNCTION MENTIONED ABOVE IN INTERVIEWS!
HOW BATCH NORMALISATION SPEEDS UP TRANING, HELPS IN SCALING AND ALSO ACTS AS A REGULARIZER IN NEURAL NETWORKS
We know that normalization and feature scaling help in achieving faster training and convergence .But why is that the case . Normalization makes our cost function symmetric in all variables . hence we do not have to worry about a certain learning rate being too little in cetain directions and too overwhelming in other .
Hence we can use sufficiently good learning rates to converge faster .THERE ARE ADAPTIVE LEARNING RATE OPTIMIZERS LIKE ADAM BUT STILL NORMALIZATION HELPS. Also we know that in neural networks data points are passed in batches .
The idea behind batch normalization is to normalize all the activations in every layer w.r.t the current batch data .
WHY BATCH NORMALISATION?
SPEEDS UP TRAINING.
REDUCES IMPORTNCE OF WEIGHT INITIALISATION (BECAUSE COSTFUNCTION IS SMOOTHENED OUT)
CAN BE THOUGHT OF AS REGULARIZERS
HOW IT CAN BE ASSOCIATED TO REGULARIZATION?
In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization applies to objective functions in ill-posed optimization problems.
NOTICE HOW WHEN NORMALIZING THE ACTIVATIONS , THE MEAN AND SIGMA VARY WITH EACH BATCH THAT PASSES , THIS CREATES SOME “RANDOMNESS” AS THE NEW UNSEEN BATCH WILL HAVE A DIFFERENT MEAN AND SIGMA VALUES . THIS CAN BE THOUGHT OF AS REGULARIZATION AS IT HELPS THE MODEL TO NOT OVERFIT TO A CERTAIN DISTRIBUTION .
REMEMBER THAT BATCH REGULARIZATION IS IMPLEMENTED WITH DROPOUTS .
IMPLEMENTATION OF BATCH NORMALIZATION IN KERAS
since mean and variance can vary greatly from batch to batch there needs to be some caliberation over these two parameters. Hence we introduce 2 learnable parameters for the purpose
NUMBER OF TUNABLE PARAMETER IN BATCH NORMALIZATION IS 2 . gamma and beta namely.(gamma for caliberating mean , beta for variance)
IMPORTANT INTERVIEW QUESTION REGARDING BATCH NORMALISATION LAYER:
how does it work differently during training an testing?
Following is the answer from the official documentation page from keras :
how attention mechanism layer is an upgrade from one context vector seq to seq models
I MEAN WHO DOESN’T CRAVE A LITTLE ATTENTION ? IT ONLY HELPS SO MUCH . MACHINE TRANSLATION AND NEURAL NETS HAVE A LONG HISTORY . BUT THE “ATTENTION MECHANISM ” IS NOT VERY OLD . APPLYING “ATTENTION” -AN ANALOGY TO HOW HUMANS READ AND PERCIEVE TEXT/SEQUENTIAL INFORMATION HAS HELPED IN ACHIEVING BETTER RESULTS IN THE FIELD OF MACHINE TRANSLATION .
HERE WE TRY TO UNDERSTAND THE IMPLEMENTATION OF AN ATTENTION LAYER , WHICH IF USED (AS AN IMPORTED PACKAGE/LIBRARY) IS A ONE LINER , BUT IF IMPLEMENTED FROM SCRATCH REQUIRES SOME WORK.
FOLLOWING IS THE INTUITION OF ATTENTION MECHANISM WITHOUT ANY MATH :
HUMANS WHEN TRYING TO TRANSLATE OR SUMMARIZE ANY SENTENCE TEND TO NOTICE THE ENTIRE SENTENCE AND KEEP A NOTICE OF HOW ALL THE WORDS , A SARCASM , ANY REFERENCES ARE RELATED TO EACH OTHER . SO , BASICALLY HOW WORDS TOGETHER ARE RESPONSIBLE FOR THE FINAL TRANSLATION .
I’LL ASSUME THAT THE READER IS ALREADY FAMILIAR WITH WORD EMBEDDINGS ,TOKENISATION,DENSE LAYERS AND EMBEDDING MATRIX.
SO OUR PROBLEM STATEMENT IS FRECH TO ENGLISH TRANSLATION
WE PERFORM TOKENISATION , CREATE EMBEDDING MATRIX AND NOW WE ARE WANTING TO ADD AN ATTENTION LAYER IN BETWEEN THE INPUT AND OUTPUT SEQUENCES . I HOPE THE INTUITION IS CLEAR .
INSTEAD OF PASSING ONE CONTEXT VECTOR, WE WANT OUR MODEL TO SEE ALL THE STATES , AND DECIDE WHAT FEATURES ARE MORE IMPORATNT DURING TRANSLATION AT EACH STAGE IN THE DECODER.
ITS TIME TO VISUALIZE THE ABOVE STATEMENT . HAVE A LOOK :
IMAGE REF: https://sknadig.dev/basics-attention/
LETS BREAK IT DOWN , WHAT YOU SEE IS THE LSTMS (BIDIRECTIONAL ENCODER) CREATING THEIR RESPECTIVE STATES , AND INSTEAD OF PASSING THE LAST STATE WE ARE PASSING WEIGHTED STATES , WEIGHTED BY FACTORS ALPHA (LEARNABLE PARAMETER ),DIFFERENT FOR EACH STATE , THIS WEIGHTED CONTEXT VECTOR +THE PRESENT DECODER STATE TOGETHER DECIDE THE NEXT STATE AND OUTPUT .
THE ABOVE PARA COMPLETELY SUMMARISES THE INTUITION . NOW LETS SEE SOME MATH . THE FINAL CONTEXT VECTOR IS THE SUM OF THE WEIGHTED HIDDEN STATES . THE OTHER CONDITION BEING THAT THE ALPHAS ARE NORMALISED TO ONE . NOW HOW TO DECIDE THE ALPHAS ?
WELL IT TURNS OUT, ” WHAT BETTER THAN A NEURAL NETWORK TO DECIDE AN APPROXIMATE FUNCTION” , HENCE :
THIS COMPTIBALITY FUNCTION(WHICH IS LTRAINABLE NETWORK IS OF DIFFERENT TYPES )
HERE WE DISCUSSION BAHDANAU ATTENTION MECHANISM (LOUNG BEING THE OTHER). LETS START BY A LITTLE CODE BEFORE GOING INTO FURTHER DETAILS . SO THE FIRST THING WE NEED IS AN ENCODER ,
WHAT WE DID ABOVE WAS JUST MAKE AN ENCODER CLASS , AND MAKE ONE OBJECT FROM THAT CLASS , ( YOU CAN USE GRU/LSTM ) , INPUT VOCAB SIZE= VOCAB SIZE OBTAINED AFTER TOKENISING YOUR DATA , THE CODE IS REFERENCED FROM TENSORFLOW NEURAL MACHINE TRANSLATION .
NOW IN BETWEEN ENCODER AND DECODER WE INTRODUCE AN ATTENTION LAYER . JUST LIKE ABOVE WE CODE A CLASS AND MAKE AN OBJECT.
THIS IS BASICALLY THE COMPATIBILITY FUNCTION AND THE ” NUMBER OF ATTENTION UNITS” DECIDES THE NUMBER OF UNITS IN DENSE LAYERS(RESPECTIVE OF THE SCORING FUNCTION USED) THAT YOU WILL INITIALIZE FOR TRAINING .HENCE ITS A HYPER PARAMETER.
def call(self, query, values): # query hidden state shape == (batch_size, hidden size) # query_with_time_axis shape == (batch_size, 1, hidden size) # values shape == (batch_size, max_len, hidden size) # we are doing this to broadcast addition along the time axis to calculate the score query_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, 1) # we get 1 at the last axis because we are applying score to self.V # the shape of the tensor before applying self.V is (batch_size, max_length, units) score = self.V(tf.nn.tanh( self.W1(query_with_time_axis) + self.W2(values)))
# context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
attention_layer = BahdanauAttention(10)
THE “SCORE” FUNCTION USED IN THE CALL FUNCTION HERE IS “CONCAT” . IN TOTAL THERE ARE 3 VARIETIES OF SCORING FUNCTIONS THAT ARE USED .
DEPENDING ON THE SCORING FUNCTIONS WE INITIALIZE OUR PARAMETERS IN THE ATTENTION CLASS. FOR THE DIFFERENT SCORING FUNCTIONS REFER THIS.
NOW WE WILL DEFIN E OUR DECODER CLASS , NOTICE HOW WE USE ATTENTION OBJECT WITHIN THE DFECODER CLASS . THIS ATTENTION TAKES INPUT FROM THE ENCODER STATES , PERFORMS THE “ATTENTON MECHANISM” OPERATION AND THEN WE DO THE “DECODING” PART . IT RETURNS THE ATTENTION WEIGHTS AND OUTPUT STATE .
WHERE vocab_tar_size IS THE VOCAB SIZE AFTER TOKENISATION OF TARGET LANGUAGE .
SO THE FINAL PICTURE IS SOMEWHAT LIKE (DOT BASED ATTENTION):
THE e’s sre supplying the normalised alphas , the alphas are performing the weighting operation , and together with present decoder state giving us the final results.
HERE I HAVE DISCUSSED THE VISUAL REPRESENTATION OF THE
ENCODER——>ATTENTION——–>DECODER PART AND ITS MATHEMATICS .
FURTHER WHAT WE SEE ABOVE IS GLOBAL ATTENTION , ANOTHER APRROACH IS LOCAL ATTENTION WHERE INSTEAD OF LOOKING AT THE ENTIRE SENTANCE WE MIGHT BE INTERESTED IN A WINDOW OF WORDS . NOW AGAIN THAT WOULD INCREASE A HYPERPARAMETER 😛 .
FURTHER STEPS ARE DEFINING OPTIMIZER, LOSS FUNCTIONS AND USING A METHOD CALLED “TEACHER FORCING ” TO TRAIN THE MODEL . FOR FURTHER READING REFER :https://www.tensorflow.org/tutorials/text/nmt_with_attention