Hey All,

this video is an introduction to Natural Language processing . It will be a series of videos that will constitute of NLP (theory and applications) from scratch

Video-1

Hey All,

this video is an introduction to Natural Language processing . It will be a series of videos that will constitute of NLP (theory and applications) from scratch

Video-1

In this video we try to understand what actually a GAN tries to achieve. GANs have become the synonym of ” fake faces” and it hides the actual mathematical problem statement , here we try to explore that.

Links to the tool and mentioned youtube video:

https://poloclub.github.io/ganlab/

https://arxiv.org/abs/1406.2661

IS GPT-3 A HYPE? WHAT MEDIA SAYS AND IS IT IMPORTANT FOR INTERVIEWS?

If you are familiar with the concept of attention and transformers , you must have come across the word “GPT” . Be it the earlier models like GPT-1 and GPT-2 or the recently released GPT-3.

GPTs are decoder only stacks (generative -pre trained models) developed by OPEN AI

Ever since GPT-3 was released platforms like twitter were flooded with posts that glorified the model and what it can do . The posts were written in a manner which would make any layman person perceive it as some sort of magic . Funny claims like “this is the end of software engineering” were made .

GPT-3 in fact is a milestone in NLP as it showed performance like never before . But one needs to understand the limitations and the reasons for such performance . Finally one can see that GPT-3 is far away from being labelled as “near to human intelligence”.

below you can see the architecture of GPT-1 model (one transformer decoder) .

**Further enhancements by varying layers and parameters led to GPT-2**

GPT-3 is structurally similar to what GPT-2 is. The main advancements are the result of an extremely large number of parameters that were used in training the model . Also the computing resources that were used were way more than any “normal ” research group can afford .

MODEL | NUMBER OF PARAMETERS | number of layers | batch size |

GPT-2 | 1.5 B | 48 | 512 |

GPT-3 SMALL | 125M | 12 | 0.5 M |

GPT-3 MEDIUM | 350M | 24 | 0.5 M |

GPT-3 LARGE | 760 M | 24 | 0.5M |

GPT -3 6.7 B | 6.7 B | 32 | 2 M |

GPT-3 13B | 13.0 B | 40 | 2 M |

GPT -3 175B OR ” GPT-3″ | 175.0 B | 96 | 3.2 M |

MAJORITY OF THE PERFORMANCE BENEFITS CAN BE SEEN COMING FROM THE ENORMOUSLY HUGE NUMBER OF PARAMETERS .

Well if you are thinking to train a gpt-3 model from scratch , you might need to think twice . Even for OPEN AI , the cost of training GPT-3 was close to **$4.6 million** . And at present computing costs training gpt 4 r gpt 8 might be too expensive even for such huge organizations .

Given GPT-3 was trained on common crawl data of the internet , the model was prone to “learn ” social bias against woman , black people and the hate comments that is present in abundance on the internet. Its not surprising these days to find two people cussing and fighting over any social media platform ,sad.

GPT-3 fails tasks which are very problem specific. You can expect it to understand and answer common daily life questions( even then there is no guarantee of cent percent accuracy. ) but it cant answer very specific medical case questions . Also there is no “fact checking mechanism ” that can ensure that the output is not not only semantically correct but is also correct as a matter of fact.

Direct implementation of transformers isn’t feasible considering the dimensionality of an image and train time complexity of a transformer . Even for people/organizations with huge computation power its overwhelming.

A RECENTLY PUBLISHED PAPER ” AN IMAGE IS WORTH 16*16 WORDS” HAS SHOWN TO USE TRANSFORMERS FOR CV TASKS . DO CHECK OUT THIS LINK:

. At the moment, not everyone **can** get **access** to it. OpenAI wants to ensure that no one misuses it. This certainly has raised some questions in the AI community and is debatable .

YES!!! any model till now is just miles away from achieving general intelligence . Even the research team of GPT-3 has clearly asked the media to not create a “FAKE BUZZ” and that even though this is a milestone for sure but it is not general intelligence and can make errors .

Given the access rights, the fact that you cannot train it , and even if you can it just would be a library implimentation like BERT , its expected only to know the theoretical part if you mention it in your resume.

😛

LINK TO GPT-3 RESEARCH PAPER : https://arxiv.org/pdf/2005.14165.pdf

IN THESE INTERVIEW PREP SERIES WE LOOK AT IMPORTANT INTERVIEW QUESTIONS ASKED IN DATA SCIENTIST /ML AND DL ROLES .

IN EACH PART WE WILL DISCUSS FEW ML INTERVIEW QUESTIONS.

Probability density function (PDF) is a statistical expression that defines a probability distribution (the likelihood of an outcome) for a continuous random variable. PDF for an interval indicates the probability of the random variable falling within the interval.

A **confidence interval** displays the probability that a parameter will fall between a pair of values around the **mean**. **Confidence intervals** measure the degree of uncertainty or certainty in a sampling method.

No. It is not a metric measure as it is not symmetric.

In probability theory, a **log-normal (or lognormal) distribution** is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable *X* is log-normally distributed, then *Y* = ln(*X*) has a normal distribution.

Spearman Rank Correlation Coefficient is determined by applying Pearson Coefficient on rank encoded random variables.

The conditional independence of the variables of a data frame is an assumption in Naive Bayes which can never be true in practice. The conditional independence assumption is made to simplify the computations of the conditional probabilities. Naive Bayes is naive due to this assumption.

This happens when the datapoints are distributed in a region on a high-dimensional manifold around i, and we try to model the pairwise distances from i to the datapoints in a two-dimensional map. For example, it is possible to have 11 datapoints that are mutually equidistant in a ten-dimensional manifold but it is not possible to model this faithfully in a two-dimensional map. Therefore, if the small distances can be modeled accurately in a map, most of the moderately distant datapoints will be too far away in the two-dimensional map.

**PCA** should be **used** mainly for variables which are strongly correlated.

If the relationship is weak between variables, **PCA** does **not work** well to reduce data. Refer to the correlation matrix to determine.

**PCA** Results Are Difficult To Interpret Clearly.

When query point is an outlier or when the data is extremely random and has no information.

- Linear relationship
- Multivariate normality
- No or little multicollinearity
- No auto-correlation
- Homoscedasticity

The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.

Numerical features are assumed to be Gaussian. Probabilities are determined by considering the distribution of the data points belonging to different classes separately.

The **naive bayes** classifers don’t offer an intrinsic method to evaluate **feature** importances. **Naïve Bayes** methods work by determining the conditional and unconditional probabilities associated with the **features** and predict the class with the highest probability.

n both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

In SGD only one data point is used per iteration to calculate the value of the loss function. While for GD all the data points are used to calculate the value of the loss function

Train time complexity O(n^{2})

Run time complexity O(k*d)

k=number of support vectors, d=dimensionality of data set

They are not that similar, but they are related though. The point is, that both kNN and RBF are non-parametric methods to estimate the density of probability of your data.

Notice that this two algorithm approach the same problem differently: kernel methods fix the size of the neighborhood (h) and then calculate K, whereas kNN fixes the number of points, K, and then determines the region in space which contain those points.

the max_depth parameter decides the overfitting and underfitting in Decision Trees.

decomposing a matrix into 2 smaller matrices with all elements greater than zero and whose product gives us the original matrix.

The **Netflix Prize** was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the **contest**.

Word embeddings are a type of word representation IN A VECTOR SPACE that allows words with similar meaning to have a similar representation.

Understanding Word embeddings ,Semantics , and Word2vec model methods like skipgram and CBOW.

remember machines can only operate on numbers . approaches like Bag of words ,tfidf provide a solution for converting texts to numbers , but not without many drawbacks . here are a few:

- as the dimensionality(vocab-size) increases , it becomes memory inefficient.
- Sine all the word representations are orthogonal , there is no semantic relation between words . That is the amount of separation between words like “apple” and “orange” is same as “apple” and “door”.

Therefore the semantic relation between words, which is really valuable information , is lost. We must come up with techniques which along with converting texts to numbers ,keeps in mind the space efficiency and also preserves the semantic relation among words .

EMBEDDINGS ARE USED IN MANY DEEP LEARNING NLP TASKS LIKE SEQ2SEQ MODELS, TRANSFORMERS etc.

before understanding how it does the job we discussed above , lets see what it actually is composed of .

Word2vec is 2 layer neural network that helps to generate word embeddings from a given text corpus in a given dimension

So essentially word2vec helps us create numerical mapping of words in vector space , remember , this dimensionality can be less than the vocab -size.

so what is the wor2vec model try to do:

it tries to make similar embeddings for words that occur in similar context.

it tries to achieve the above using 2 approaches algorithms :

- CBOW( continuous bag of words)
- Skip-gram

CBOW :A MODEL THAT TRIES TO PREDICT THE TARGET WORD FROM THE GIVEN CONTEXT WORDS

SKIPGRAM: A MODEL THAT TRIES TO PREDICT THE CONTEXT WORDS FROM THE GIVEN TARGET WORD

lets break it down. Suppose we define “context” to be a window of m words , we iterate over all such windows from the beginning of our sentence and try to maximize the conditional probability of a certain words occurring given a context word or vice versa.

- LETS SAY THAT YOUR VOCAB SIZE IS V .
- WE PERFORM ONE-HOT ENCODING FOR ALL WORDS .
- WE CHOOSE OUR WINDOW SIZE AS C.
- WE DECIDE THE DIMENSON OF THE EMBEDDING DIMENSION , LET SAY N .OUR HIDDEN LAYER WILL HAVE N NEURONS.
- THE OUTPUT IS A SOFTMAX LAYER OF DIMENSION V.
- WITH THESE THINGS WE CREATE THE BELOW ARCHITECTURE.

WE PERFORM THE TRAINING UPDATE THE WEIGHTS , AFTER TRAINING THE WEIGHT BETWEEN HIDDEN LAYER AND OUTPUT SOFT-MAX IS OUR WORD EMBEDDING MATRIX . NOTE ITS DIMENSION IS (OUR SELECTED LOWER DIMENSION)*(VOCAB-SIZE) . HENCE IT CONTAINS THE N-DIMENSIONAL VECTOR SPACE REPRESENTATION OF ALL THE WORDS IN THE VOCABULARY.

THE WORKING OF SKIP GRAM IS JUST THE OPPOSSITE . THERE WE USE ONE WORD TO PREDICT THE CONTEXT WORDS.

NOW THE FINAL STEP, TO OBTAIN WORD VECTOR OF ANY WORD, WE TAKLE THE EMBEDDING MATRIX THAT WE TRAINED AND MULTILY IT BY ITS ONE HOT-ENCODING REPRESENTATION.

THATS IT!

ITS ALWAYS HELPFUL TO REMEMBER THE LOSS FUNCTION MENTIONED ABOVE IN INTERVIEWS!

HOW BATCH NORMALISATION SPEEDS UP TRANING, HELPS IN SCALING AND ALSO ACTS AS A REGULARIZER IN NEURAL NETWORKS

We know that normalization and feature scaling help in achieving faster training and convergence .But why is that the case . Normalization makes our cost function symmetric in all variables . hence we do not have to worry about a certain learning rate being too little in cetain directions and too overwhelming in other .

Hence we can use sufficiently good learning rates to converge faster .THERE ARE ADAPTIVE LEARNING RATE OPTIMIZERS LIKE ADAM BUT STILL NORMALIZATION HELPS. Also we know that in neural networks data points are passed in batches .

The idea behind batch normalization is to normalize all the activations in every layer w.r.t the current batch data .

- SPEEDS UP TRAINING.
- REDUCES IMPORTNCE OF WEIGHT INITIALISATION (BECAUSE COSTFUNCTION IS SMOOTHENED OUT)
- CAN BE THOUGHT OF AS REGULARIZERS

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems,

WIKIPEDIAregularizationis the process of adding information in order to solve an ill-posed problem or to prevent overfitting.Regularizationapplies to objective functions in ill-posed optimization problems.

NOTICE HOW WHEN NORMALIZING THE ACTIVATIONS , THE MEAN AND SIGMA VARY WITH EACH BATCH THAT PASSES , THIS CREATES SOME “RANDOMNESS” AS THE NEW UNSEEN BATCH WILL HAVE A DIFFERENT MEAN AND SIGMA VALUES . THIS CAN BE THOUGHT OF AS REGULARIZATION AS IT HELPS THE MODEL TO NOT OVERFIT TO A CERTAIN DISTRIBUTION .

REMEMBER THAT BATCH REGULARIZATION IS IMPLEMENTED WITH DROPOUTS .

since mean and variance can vary greatly from batch to batch there needs to be some caliberation over these two parameters. Hence we introduce 2 learnable parameters for the purpose

NUMBER OF TUNABLE PARAMETER IN BATCH NORMALIZATION IS 2 . gamma and beta namely.(gamma for caliberating mean , beta for variance)

how does it work differently during training an testing?

Following is the answer from the official documentation page from keras :