c75d1c5a85d5badb836cb900a0dff8cd

The Fault In “GPT-3”. IS IT A HYPE?

IS GPT-3 A HYPE? WHAT MEDIA SAYS AND IS IT IMPORTANT FOR INTERVIEWS?

If you are familiar with the concept of attention and transformers , you must have come across the word “GPT” . Be it the earlier models like GPT-1 and GPT-2 or the recently released GPT-3.

GPTs are decoder only stacks (generative -pre trained models) developed by OPEN AI

Ever since GPT-3 was released platforms like twitter were flooded with posts that glorified the model and what it can do . The posts were written in a manner which would make any layman person perceive it as some sort of magic . Funny claims like “this is the end of software engineering” were made .

GPT-3 in fact is a milestone in NLP as it showed performance like never before . But one needs to understand the limitations and the reasons for such performance . Finally one can see that GPT-3 is far away from being labelled as “near to human intelligence”.

CHANGES FROM GPT-2 AND GPT -1

below you can see the architecture of GPT-1 model (one transformer decoder) .

Further enhancements by varying layers and parameters led to GPT-2

ref: https://jalammar.github.io/illustrated-gpt2/

GPT-3 is structurally similar to what GPT-2 is. The main advancements are the result of an extremely large number of parameters that were used in training the model . Also the computing resources that were used were way more than any “normal ” research group can afford .

GPT-2 VS GPT-3 PARAMETERS COMPARISION

MODEL NUMBER OF PARAMETERSnumber of layersbatch size
GPT-21.5 B48512
GPT-3 SMALL125M120.5 M
GPT-3 MEDIUM350M240.5 M
GPT-3 LARGE760 M240.5M
GPT -3 6.7 B6.7 B322 M
GPT-3 13B13.0 B402 M
GPT -3 175B OR ” GPT-3″175.0 B963.2 M
. All models (gpt-3) were trained for a total of 300 billion tokens. DATA SET: COMMON CRAWL

MAJORITY OF THE PERFORMANCE BENEFITS CAN BE SEEN COMING FROM THE ENORMOUSLY HUGE NUMBER OF PARAMETERS .

IS IT SCALABLE ?

Well if you are thinking to train a gpt-3 model from scratch , you might need to think twice . Even for OPEN AI , the cost of training GPT-3 was close to $4.6 million . And at present computing costs training gpt 4 r gpt 8 might be too expensive even for such huge organizations .

THE NEGATIVE BIAS

Given GPT-3 was trained on common crawl data of the internet , the model was prone to “learn ” social bias against woman , black people and the hate comments that is present in abundance on the internet. Its not surprising these days to find two people cussing and fighting over any social media platform ,sad.

ALWAYS RIGHT?

GPT-3 fails tasks which are very problem specific. You can expect it to understand and answer common daily life questions( even then there is no guarantee of cent percent accuracy. ) but it cant answer very specific medical case questions . Also there is no “fact checking mechanism ” that can ensure that the output is not not only semantically correct but is also correct as a matter of fact.

GPT FOR VISION ?

Direct implementation of transformers isn’t feasible considering the dimensionality of an image and train time complexity of a transformer . Even for people/organizations with huge computation power its overwhelming.

A RECENTLY PUBLISHED PAPER ” AN IMAGE IS WORTH 16*16 WORDS” HAS SHOWN TO USE TRANSFORMERS FOR CV TASKS . DO CHECK OUT THIS LINK:

ACCESS RIGHTS

. At the moment, not everyone can get access to it. OpenAI wants to ensure that no one misuses it. This certainly has raised some questions in the AI community and is debatable .

WHERE CAN YOU SEE A DEMO? CHECK THIS OUT !

MEDIA HYPE?

YES!!! any model till now is just miles away from achieving general intelligence . Even the research team of GPT-3 has clearly asked the media to not create a “FAKE BUZZ” and that even though this is a milestone for sure but it is not general intelligence and can make errors .

FOR INTERVIEWS?

Given the access rights, the fact that you cannot train it , and even if you can it just would be a library implimentation like BERT , its expected only to know the theoretical part if you mention it in your resume.

😛

LINK TO GPT-3 RESEARCH PAPER : https://arxiv.org/pdf/2005.14165.pdf

photo-1566913485242-694e995731b4

ML INTERVIEW QUESTIONS -PART 1

IN THESE INTERVIEW PREP SERIES WE LOOK AT IMPORTANT INTERVIEW QUESTIONS ASKED IN DATA SCIENTIST /ML AND DL ROLES .

IN EACH PART WE WILL DISCUSS FEW ML INTERVIEW QUESTIONS.

1) What is PDF?

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable. PDF for an interval indicates the probability of the random variable falling within the interval. 

2) What is Confidence Interval?

confidence interval displays the probability that a parameter will fall between a pair of values around the meanConfidence intervals measure the degree of uncertainty or certainty in a sampling method.

3) Can KL divergence be used as a distance measure?

No. It is not a metric measure as it is not symmetric.

4) What is Log-normal distribution? 

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

5) What is Spearman Rank Correlation Coefficient?

Spearman Rank Correlation Coefficient is determined by applying Pearson Coefficient on rank encoded random variables.

6) Why is “Naive” Bayes naive?

The conditional independence of the variables of a data frame is an assumption in Naive Bayes which can never be true in practice. The conditional independence assumption is made to simplify the computations of the conditional probabilities. Naive Bayes is naive due to this assumption.

7)What is the “Crowding problem” in t-sne? 

This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map.

8)What are the limitations of PCA?

PCA should be used mainly for variables which are strongly correlated.

If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. 

PCA Results Are Difficult To Interpret Clearly.

9)Name 2 failure cases of KNN?

When query point is an outlier or when the data is extremely random and has no information.

10) Name 4 assumptions of linear regression

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

11)Why are log probabilities used in Naive -bayes?

The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

12)How to handle Numerical features in(Gaussian NB)? 

Numerical features are assumed to be Gaussian. Probabilities are determined by considering the distribution of the data points belonging to different classes separately.

13)How do you get a feature important in naive Bayes?

The naive bayes classifers don’t offer an intrinsic method to evaluate feature importances. Naïve Bayes methods work by determining the conditional and unconditional probabilities associated with the features and predict the class with the highest probability.

14)Differentiate between GD and SGD.

n both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

In SGD only one data point is used per iteration to calculate the value of the loss function. While for GD all the data points are used to calculate the value of the loss function

15)Do you know the train and run time complexities for a SVM model?

Train time complexity O(n2)

Run time complexity O(k*d)

k=number of support vectors, d=dimensionality of data set

16)Why is RBF kernel SVM compared to kNN?

They are not that similar, but they are related though. The point is, that both kNN and RBF are non-parametric methods to estimate the density of probability of your data.

Notice that this two algorithm approach the same problem differently: kernel methods fix the size of the neighborhood (h) and then calculate K, whereas kNN fixes the number of points, K, and then determines the region in space which contain those points.

RBF KERNEL

17)What decides overfitting and underfitting in DT?

the max_depth parameter decides the overfitting and underfitting in Decision Trees.

18)What is Non negative Matrix Factorization?

decomposing a matrix into 2 smaller matrices with all elements greater than zero and whose product gives us the original matrix.

19)what is Netflix prize problem?

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

20) What are word embeddings?

Word embeddings are a type of word representation IN A VECTOR SPACE that allows words with similar meaning to have a similar representation.

photo-1605773433673-f46852ff81c9

THE WORD2VEC MODEL (CBOW /SKIP-GRAM)

Understanding Word embeddings ,Semantics , and Word2vec model methods like skipgram and CBOW.

WHY embeddings?

remember machines can only operate on numbers . approaches like Bag of words ,tfidf provide a solution for converting texts to numbers , but not without many drawbacks . here are a few:

  1. as the dimensionality(vocab-size) increases , it becomes memory inefficient.
  2. Sine all the word representations are orthogonal , there is no semantic relation between words . That is the amount of separation between words like “apple” and “orange” is same as “apple” and “door”.

Therefore the semantic relation between words, which is really valuable information , is lost. We must come up with techniques which along with converting texts to numbers ,keeps in mind the space efficiency and also preserves the semantic relation among words .

EMBEDDINGS ARE USED IN MANY DEEP LEARNING NLP TASKS LIKE SEQ2SEQ MODELS, TRANSFORMERS etc.

WORD2vec

before understanding how it does the job we discussed above , lets see what it actually is composed of .

Word2vec is 2 layer neural network that helps to generate word embeddings from a given text corpus in a given dimension

So essentially word2vec helps us create numerical mapping of words in vector space , remember , this dimensionality can be less than the vocab -size.

so what is the wor2vec model try to do:

it tries to make similar embeddings for words that occur in similar context.

it tries to achieve the above using 2 approaches algorithms :

  1. CBOW( continuous bag of words)
  2. Skip-gram

CBOW and SKIP-GRAM

CBOW :A MODEL THAT TRIES TO PREDICT THE TARGET WORD FROM THE GIVEN CONTEXT WORDS

CBOW ALGORITHM REPRESENTATION

SKIPGRAM: A MODEL THAT TRIES TO PREDICT THE CONTEXT WORDS FROM THE GIVEN TARGET WORD

here is how we define a loss function : p(wt+j|wt) for skipgram ; p(wt|wt+j) for cbow:

lets break it down. Suppose we define “context” to be a window of m words , we iterate over all such windows from the beginning of our sentence and try to maximize the conditional probability of a certain words occurring given a context word or vice versa.

THE WORKING

  1. LETS SAY THAT YOUR VOCAB SIZE IS V .
  2. WE PERFORM ONE-HOT ENCODING FOR ALL WORDS .
  3. WE CHOOSE OUR WINDOW SIZE AS C.
  4. WE DECIDE THE DIMENSON OF THE EMBEDDING DIMENSION , LET SAY N .OUR HIDDEN LAYER WILL HAVE N NEURONS.
  5. THE OUTPUT IS A SOFTMAX LAYER OF DIMENSION V.
  6. WITH THESE THINGS WE CREATE THE BELOW ARCHITECTURE.
cbow word2vec
CBOW WORD2VEC ACHITECTURE

WE PERFORM THE TRAINING UPDATE THE WEIGHTS , AFTER TRAINING THE WEIGHT BETWEEN HIDDEN LAYER AND OUTPUT SOFT-MAX IS OUR WORD EMBEDDING MATRIX . NOTE ITS DIMENSION IS (OUR SELECTED LOWER DIMENSION)*(VOCAB-SIZE) . HENCE IT CONTAINS THE N-DIMENSIONAL VECTOR SPACE REPRESENTATION OF ALL THE WORDS IN THE VOCABULARY.

THE WORKING OF SKIP GRAM IS JUST THE OPPOSSITE . THERE WE USE ONE WORD TO PREDICT THE CONTEXT WORDS.

SKIPGRAM WORD2VEC ACHITECTURE

NOW THE FINAL STEP, TO OBTAIN WORD VECTOR OF ANY WORD, WE TAKLE THE EMBEDDING MATRIX THAT WE TRAINED AND MULTILY IT BY ITS ONE HOT-ENCODING REPRESENTATION.

THATS IT!

ITS ALWAYS HELPFUL TO REMEMBER THE LOSS FUNCTION MENTIONED ABOVE IN INTERVIEWS!

artificial data generation

BATCH NORMALISATION IN NEURAL NETWORKS

HOW BATCH NORMALISATION SPEEDS UP TRANING, HELPS IN SCALING AND ALSO ACTS AS A REGULARIZER IN NEURAL NETWORKS

We know that normalization and feature scaling help in achieving faster training and convergence .But why is that the case . Normalization makes our cost function symmetric in all variables . hence we do not have to worry about a certain learning rate being too little in cetain directions and too overwhelming in other .

Hence we can use sufficiently good learning rates to converge faster .THERE ARE ADAPTIVE LEARNING RATE OPTIMIZERS LIKE ADAM BUT STILL NORMALIZATION HELPS. Also we know that in neural networks data points are passed in batches .

The idea behind batch normalization is to normalize all the activations in every layer w.r.t the current batch data .

WHY BATCH NORMALISATION?

  1. SPEEDS UP TRAINING.
  2. REDUCES IMPORTNCE OF WEIGHT INITIALISATION (BECAUSE COSTFUNCTION IS SMOOTHENED OUT)
  3. CAN BE THOUGHT OF AS REGULARIZERS

HOW IT CAN BE ASSOCIATED TO REGULARIZATION?

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization applies to objective functions in ill-posed optimization problems.

WIKIPEDIA

NOTICE HOW WHEN NORMALIZING THE ACTIVATIONS , THE MEAN AND SIGMA VARY WITH EACH BATCH THAT PASSES , THIS CREATES SOME “RANDOMNESS” AS THE NEW UNSEEN BATCH WILL HAVE A DIFFERENT MEAN AND SIGMA VALUES . THIS CAN BE THOUGHT OF AS REGULARIZATION AS IT HELPS THE MODEL TO NOT OVERFIT TO A CERTAIN DISTRIBUTION .

REMEMBER THAT BATCH REGULARIZATION IS IMPLEMENTED WITH DROPOUTS .

IMPLEMENTATION OF BATCH NORMALIZATION IN KERAS

since mean and variance can vary greatly from batch to batch there needs to be some caliberation over these two parameters. Hence we introduce 2 learnable parameters for the purpose

NUMBER OF TUNABLE PARAMETER IN BATCH NORMALIZATION IS 2 . gamma and beta namely.(gamma for caliberating mean , beta for variance)

THE CITATION IN THE RESEARCH PAPER SHOWN IN THE VERY BEGINNING

IMPORTANT INTERVIEW QUESTION REGARDING BATCH NORMALISATION LAYER:

how does it work differently during training an testing?

Following is the answer from the official documentation page from keras :