The post Kolmogorov–Smirnov test ,comparing distributions appeared first on 7 HIDDEN LAYERS .

]]>The post Kolmogorov–Smirnov test ,comparing distributions appeared first on 7 HIDDEN LAYERS .

]]>The post Weight initialization techniques visual demo appeared first on 7 HIDDEN LAYERS .

]]>Of course these weights are what a network learns in course of its training phase . But in order to update these numbers eventually , we must start somewhere . So, the question is , How to initialize the weights of a network?

There seems to be various ways in which one can do so , for eg ,

- Symmetrically assigning all weights
- Randomly assigning all weights
- Assigning weights by sampling from a distribution

Lets look at each of the above cases and finally conclude some techniques that are used widely to initialize weights in deep networks.

So, the first case is **W ^{k}_{ij}=c for all i,j **,k , that is weights of all layers and in between any two nodes are zero is same and has the value c . Since the value is same for all neurons , all the neurons would be symmetric(between neuron and all its subsequent connections) and will receive same updates . We want each neuron to learn a certain feature and this initialization technique wont let that happen .

To demonstrate the above see the below visualization:

- using tensor flow playground initialize all weights to zero
- let the model train
- Notice how all the weights originating from a neuron are same .
- you will see that every neuron in the first has unique but different weight value for all its connections, but after that between any layer L-1 and L all values are same as they are symmetric.

Of course assigning random values can help remove the problem of symmetricity but there is one drawback of random weight assignments . Absurdly large and small values of weights will result in a problem of vanishing gradients as at extremely large and small values a sigmoid functions derivative would be extremely close to zero , hence it will hinder weight updates after iteration .

Lets try to visualize the same in tensor flow playground :

- lets assign some large positive weights and some large negative weights
- notice how the training seems to halt even after numerous epochs

A need for better and improved methods led to development of new activation function specific weight initialization techniques. Xavier initialization is used for tanh activations and its logic is as follows:

Xavier initialization tries to initialize the weights of the network so that the neuron activation functions are not starting out in saturated or dead regions. We want to initialize the weights with random values which are not “too small or large.”

NOTE: there are 2 variations of Xavier initialization: based on the sampling technique used (uniform or normal) .

**Where for any neuron input=number of incoming connections , and output=number of output connections**

But why does this work?

An interviewer can seemingly put numerous questions as to how this weird looking formulae cam into existance . The answer lies in the following explanation.

- Xa. initialization is trying to ensure that the dot product
**w.x**that is fed to tanh is neither too large or low. - to restrict the overall quantity we can always have a control on x values ( normalization, also ensures that the data has zero mean and unit variance)
- Now all that xavier in. does is that it ensures the variance of net product =
*Var(net*of_{j})**w.x**is 1 .

In the formula above , n is the number of incoming connections , the research paper in order to deal with outgoing connections also takes in account the number of outgoing connections and hence the final formula.

In He initialization, (used for ReLu activations) we multiply random initialization with the below term (size l-1 is same as number of incoming connections)

As we saw in Xavier initialization , He initialization is also of 2 types , namely he normal and he uniform

A few other weight initialization techniques like “Kaiming initialization” are also there ,although i have rarely seen questions on it .

more to come!! meanwhile you can visit https://www.tensorflow.org/api_docs/python/tf/keras/initializers/

The post Weight initialization techniques visual demo appeared first on 7 HIDDEN LAYERS .

]]>The post PCA for interviews-Eigenvectors/Langrangian appeared first on 7 HIDDEN LAYERS .

]]>One of the most important mathematical concept required to understand the working of PCA is the concept of Eigenvectors and eigenvalues . If you know the algorithm , you must be aware that in order to reduce the dimensions of a certain dataset with dimension d to a smaller dimension k , PCA works by calculating the eigenvectors/values of the covariance matrix of the data set and select the top k values in descending order.

But why ? Why does the concept of eigenvalues and eigenvectors , gets involved during reduction of dimensions.

For any square matrix A , its Eigenvectors and eigenvalues have the following relation:

Geometrically speaking , one can see that ,for given eigenvector-value pair , the matrix multiplication is equal to a mere scaling of the eigenvector by a factor lambda . Hence its direction is unchanged .

**Eigenvalues** and **eigenvectors** are **only for square matrices**. **Eigenvectors** are by definition nonzero. **Eigenvalues** may be equal to zero.

Now lets look at another concept used in PCA

Sometimes , when one needs to find maxima/minima of any expression which in turn is following a certain constraint , you cannot equate its first derivative and equate it to zero. You must take care of the constraint given .

Such constrained optimisation problems are solved using the concept of LANGRANGE MULTIPLIERS.

Note: Langrange multipliers are always positive .

Now lets see how the 2 concepts , Eigenvector/values and Langrangian multipliers are required to solve the PCA optimisation problem . Recall that how PCA is all about finding axes with maximum variance . Below is the optimisation equation of PCA, where S is the covariance matrix of the dataset , hence a d*d dimension square matrix (where d is the original dimension of the dataset) and u is the direction along which the variance has to be maximized.

you can see that the above equation is just a constrained optimisation problem , using langrange multiplier lambda in langrangian **L(u,lambda)** and equating** dL(u,lambda)/du** and **dL(u,lambda)/d(lambda) **to zero you can see that solving the above equation will give the solution of the form

so now you see how the solutions u are just eigenvectors of S .( And lambda the corresponding eigenvalues) Now these eigenvalues with decreasing order will give decreasing variance , hence for reducing d into k smaller dimensions we select the top k eigenvalues in descending order .

This post deals with the mathematics behind the optimisation of PCA algorithm. Other important interview questions are related to its limitations , comparision to tsne , and more!

Till then enjoy this

The post PCA for interviews-Eigenvectors/Langrangian appeared first on 7 HIDDEN LAYERS .

]]>The post GloVe Model appeared first on 7 HIDDEN LAYERS .

]]>quoted from the glove research paper

The statistics of word occurrences in a corpus is

the primary source of information available to all

unsupervised methods for learning word representations, and although many such methods now exist, the question still remains as to how meaning

is generated from these statistics, and how the resulting word vectors might represent that meaning.

Recall word2vec model . The aim/objective function of w2v (let say CBOW) was to maximize the likelihood/log likelihood of the probability of the main word given context words . Here the context words were decided by a “window” and only local information was used . This was a point raised in the glove research paper .”Global ” in the word glove refers to utilising the information of co-occurrence of words from the entire corpus .

Another point that was raised in the paper was regarding the cost complexity/efficiency of using soft-max in the w2v model. The research paper shows how glove model helps to handle this .

Lets start by understanding what is a co-occurrence matrix:

suppose we have 3 sentences :

- I enjoy flying.
- I like NLP.
- I like deep learning.

considering one word to the left and one word to the right as context , we simply count the number of occurrences of other words within this context . using the above sentences , we have 8 distinct vocab elements , (including full stop) and this is how the co-occurrence matrix X looks like:

First we establish some notation. Let the matrix

of word-word co-occurrence counts be denoted by

X, whose entriesXtabulate the number of times_{i j}

word j occurs in the context of word i. Let_{Xi}= summation over k(k X) be the number of times any word appears_{ik}

in the context of word i. Finally, let P_{i j}= P(j|i) =Xbe the probability that word j appear in the context of word i_{i j}/X_{i}

Using the above definitions the paper demonstrates how the probability of finding a certain word in context given a certain word is found . Following points can be noted:

- notice how all the individual probabilities are in the order of 10^(-4 or lower) , this can be attributed to the fact that corpus size is huge and words and there context is a sparse matrix .
- It is difficult to interpret such differences as you can see there is no such pattern and all numbers are small.
- the solution the researchers gave was to instead use a ratio . as you can see from the third row of the table that it makes things interpretable.
- I f the word is more probable wrt context of ice the ratio would be a large number
- if the word is more probable wrt steam the ratio would be small number
- if the word is out of context from ice and steam both the value would approach one .

Notice how visualising the ratio gives significant information in large corpus .

The function F can be any function that solves our problem , (w’s are the 3 weight vectors of the 3 words ), further 2 points can be noted ,

- the RHS is a scalar quantity , The LHS operates on vectors ,hence we can use dot product
- In vector spaces its the difference in vectors that carries information , hence we can write F as:

but there is one more point !

Any word will keep interchanging its role as “word” to “context word” .The final model should be invariant in this relabelling . Hence we need to take care to maintain such symmetricity .

To do so

consistently, we must not only exchange w ↔ w˜

but also X ↔ X^{T}

The research paper cites it as follows :

Next, we note that Eqn. (6) would exhibit the exchange symmetry if not for the log(X

_{i}) on the

right-hand side. However, this term is independent of k so it can be absorbed into a bias b_{i}for w_{i}.

Finally, adding an additional bias b^{˜}_{k}for w˜_{k}restores the symmetry.

One major drawback at this stage is that suppose you have a context window of 10 words to the left and right and 10 to the right .Since there is no “weighing factor” involved the 10 th word is equally important as the word occurring right beside . This must be taken care of:

- f should be continuous
- f(0)=0
- f(x) should be non decreasing so that rare occurrences are not over weighted
- f(x) should not explode or be too large for large values of x so that frequent occurrences do not get over weighted.

The research paper shows how it performs against the other well know algo , w2v

the glove model has sublinear complexity with value O(|C|^{0.8}) .where c is the corpus size . Of course this is much better than worst case complexity O(V^{2} ).

once you read this blog, it would be easier to read the research paper,

https://www.aclweb.org/anthology/D14-1162.pdf

Enjoy! here is something out of “context” :p

The post GloVe Model appeared first on 7 HIDDEN LAYERS .

]]>The post TIME COMPLEXITIES OF ML ALGORITHMS appeared first on 7 HIDDEN LAYERS .

]]>TIME COMPLEXITIES ARE IMPORTANT TO REMEMBER IN RELATION TO THE FOLLOWING QUESTIONS ASKED:

- WHEN WOULD YOU CHOOSE ONE MODEL OVER THE OTHER.
- IS A PARTICULAR ALGORITHM GOOD FOR INTERNET APPLICATIONS ?
- WHETHER AN ALGORITHM SCALES GOOD ON LARGE DATSETS/DATABASES.
- IS RE-TRAINING A PARTICULAR MODEL COSTLY IN TERMS OF RESOURCES.
- IF INPUT DISTRIBUTION IS CHANGING CONSTANTLY WHICH MODEL IS PREFERABLE.

LETS START OUR DISCUSSION ON TIME COMPLEXITIES . I WILL ASSUME YOU KNOW THE BASIC INTUITION OF ALL THE ALGORITHMS AND HENCE THIS WILL JUST BE A CRISP TO THE POINT SUMMARY !

REMEMBER , MANY TIMES YOU MIGHT FIND DIFFERENT ANSWERS ON THE INTERNET FOR SAME ALGO BECAUSE OF THE UNDERLYING DATA STRUCTURE OR CODE OPTIMISATION USED. OR MAYBE BECAUSE THEY ARE STATING BEST /WORST CONDITIONS . BEST APPROACH IS TO KNOW ONE COMPLEXITY AND THE CORRESPONDING PROCEDURE USED .

TERMINOLOGY :

DATASET = {X,Y} DIMENSIONS OF X= n*m , DIMENSIONS OF Y = n*1

train -time complexity = O((m^2)*(n+m)) run -time complexity= O(m) space complexity(during run time)= O(m) conclusion: small run time and space complexity! good for low latency problems . source:https://levelup.gitconnected.com/train-test-complexity-and-space-complexity-of-linear-regression-26b604dcdfa3#:~:text=So%2C%20runtime%20complexity%20of%20Linear,features%2Fdimension%20of%20the%20data.

TERMINOLOGY:

DATASET = {X,Y} DIMENSIONS OF X= n*m , DIMENSIONS OF Y = n*1

train -time complexity = O(n*m) run -time complexity= O(m) space complexity(during run time)= O(m) conclusion: small run time and space complexity! good for low latency problems .

DATASET = {X,Y} DIMENSIONS OF X= n*d , DIMENSIONS OF Y = n*1 CLASSES =C

train -time complexity = O(n*d*c) run -time complexity= O(d*c) space complexity(during run time)= O(d*c) conclusion: good for high dimensional data set .used in spam detection and other simple non semantic NLP problems . considered as base model for comparision with complex models.

DATASET = {X,Y} DIMENSIONS OF X= n*d , DIMENSIONS OF Y = n*1 S=NUMBER OF SUPPORT VECTORS

train -time complexity = O(n^2) run -time complexity= O(S*d) space complexity(during run time)= O(S) conclusion: high train time complexity .but low latency and space complexity.

n=NUMBER OF SAMPLE DATA POINTS ,

d = THE DIMENSIONALITY

k =depth of decision tree

p = number of nodes

train -time complexity = O(n*log(n)*d) run -time complexity= O(k) space complexity(during run time)= O(p)NOW IF WE ARE TAKING OF RANDOM FOREST (M BASE LEARNERS)train -time complexity = O(n*log(n)*d*m) run -time complexity= O(k*m) space complexity(during run time)= O(p*m)TALKING OF GBDT(m SUCCESIVE MODELS , b is the shrinkage factor(m such coefficients , it will ultimately be a constant addition in complexity and can be ignored asymptotically but its better to point out during interviews as it is what separates it from RF )

train -time complexity = O(n*log(n)*d*m) run -time complexity= O(k*m) space complexity(during run time)= O(p*m +b*m)

K= number of nearest neighbours

d= dimensions

(no training phase)

run-time complexity = O(n*d) ( the "+kd" term is ignored asymptotically space complexity(during run time)= O(n*d) https://stats.stackexchange.com/questions/219655/k-nn-computational-complexity

K= number of nearest neighbours

d= dimensions

An **algorithm** that builds a balanced k-d **tree** to sort points has a worst-case **complexity** of O(kn log n).

run-time complexity(best-case)=O(2^d *(k)*log n) (log n to find cells “near” the query point 2d to search around cells in that neighborhood ) run-time complexity(worst-case)=O(k*n)space-complexity= O(K·N)

AS you can see not suitable for very large dimensions .

NAME OF ALGO | TRAIN COMPL. | SPACE-COMPLEXITY | REMARK |

K MEANS | O(n*k*d*i) n=points d=dimensions k=number of clusters i=iterations | O(nd+kd) “kd” for centroids | ASYMPTOTICALLY FOR SMALL d and k =O(nd) |

hierarchial clustering | O(n^3) | O(n^2) | not suitable for low latency and low space problems |

DBSCAN | O(n^2) but can be made O(nlogn) in lower dimensions using efficient data structures | O(n) {d<<n} | better than heirarchial {in terms of complexity} |

NOW THIS IS A TRICKY ONE, VARIOUS METHODS ARE AVAILABLE AND GENERALLY ON THE INTERNET YOU WILL FIND MULTIPLE ANSWERS TO THIS QUESTION :

BELOW I STATE WHAT WIKIPEDIA STATES , (IF YOU STATE ANY OTHER ANSWER IN INTERVIEW JUST BE SURE REMEMBER TO UNDERSTAND THE CORRESPONDING ALGO USED)

FOR One matrix, TIME COMPLXITY () https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations

N=NUMBER OF POINTS , D=DIMEMTIONALITY

TRAIN-TIME COMPLEXITY (PCA) =(ND×min(N, D) +D³)

EXPLANATION:

{The complexity of **Covariance matrix computation** is O(D^2(n)). Its eigenvalue decomposition is O(D^3).}

NOTICE THAT AFTER THIS WE JUST SELECT FIRST d (TARGET DIIMENSION ) VECTORS

The main reason for t-SNE being slower than PCA is that no analytical solution exists for the criterion that is being optimised. Instead, a solution must be approximated through gradient descend iterations.

t-SNE has a quadratic time and space complexity in the number of data points.

T-SNE requires O( **3N^2)** of memory.

https://www.geeksforgeeks.org/ml-t-distributed-stochastic-neighbor-embedding-t-sne-algorithm/

The post TIME COMPLEXITIES OF ML ALGORITHMS appeared first on 7 HIDDEN LAYERS .

]]>The post Gaussian NAIVE BAYES, continuous features appeared first on 7 HIDDEN LAYERS .

]]>Here we discuss one of the approaches used for handling continuous variables when it comes to naive bayes.

Suppose we have the following dataset , where the target variable is whether a movie will be hit or not and the feature variables are the action rating and story rating (a whole numbers between 1 to 10)

ACTION RATING (AR) | STORY RATING (SR) | HIT/FLOP |

7.2 | 5.8 | HIT |

3.4 | 6.3 | FLOP |

3.5 | 7.3 | FLOP |

8.5 | 8.0 | HIT |

6.9 | 2.8 | FLOP |

7.0 | 5.3 | HIT |

9.0 | 3.8 | HIT |

NOW LETS SUPPOSE WE HAVE A TEST POINT : ACTION RATING=7.6 , STORY RATING= 5.7 . So these are what we need to predict:

LETS START BY CONSIDERING THE FIRST PROBABILITY EXPRESSION

BUT NO SUCH POINT IS PRESENT IN THE DATA SET , SO SHOULD WE SET THIS PROBABILITY TO ZERO? AND SIMILAR WITH THE SECOND EXPRESSION? THIS WOULD MEAN THAT ANY UNSEEN POINT WOULD ALWAYS LEAD TO BOTH PROBABILITIES TURNING TO ZERO. SO HOW DO WE RESOLVE THIS ISSUE ? LETS GET THERE.

There are 3 expressions that are needed to be evaluated in the below expression

**P(HIT| AR= 7.6, SR=5.7)** = **P( AR= 7.6|HIT) * P( SR= 5.7|HIT) * P(HIT)**

the **P(HIT)** calculation is straightforward and is equal to *{total number of hits/Total number of hits and flops}*.

For calculating the 2 left conditional probabilities we assume that the values in the data set are sampled from a gaussian distribution with mean and variance calculated from the sample points . To recall , this is what a Gaussian distribution looks like:

Now once we have the gaussian distribution for our column feature , we can get the pdf value for any point , whether it is present in our data set or not .

IMPORTANT POINTS TO BE NOTED:

- While calculating
**P(HIT| AR= 7.6, SR=5.7), the gaussian distribution will be made only using data points where output =HIT** - different distributions are calculated/obtained for every column and target variable , so here there will be 4 distributions used ; whose data points are from
*AR FOR HIT, AR FOR FLOP , SR FOR HIT ,SR FOR FLOP*

- You can always plot the dist-plot and see whether the distribution is gaussian or not .
- Before applying Gaussian Naive Bayes you can use Box-Cox transform to make the distribution normal.
- If you see that columns are varying hugely from gaussian distribution you an use different distributions , other distributions are log-normal( also box-cox with gamma=0 gives log-normal distribution) , power law etc. Below you see the general expression used in box cox distribution , you can see how gamma=0 turns it into a log distribution.

With the above points in mind you are ready to use Gaussian Naive Bayes!! You can read more about Box cox transform here :

More to come!

The post Gaussian NAIVE BAYES, continuous features appeared first on 7 HIDDEN LAYERS .

]]>The post Splitting nodes in DT for continuous features(classification) appeared first on 7 HIDDEN LAYERS .

]]>But when the features are continuous , how does one split the nodes of the decision tree? I assume you are familiar with the concept of entropy .

Suppose that we have a training data set of n sample points . let us consider one particular feature f1 which is continuous in nature .

- We need to perform splitting of nodes for all sample points .
- we sort the f1 column in ascending order .
- then taking every value in f1 as a threshold, calculate the entropy and then an Information Gain.
- we select the threshold with the most information gain and make a split.
- we then continue to do the same for leaf nodes until either max_depth is reached or min_samples required to reach is more than sample points .

Lets try to understand the above by one example :

let the following be the f1 feature column and let say its a two class classification problem:,

F1(NUMERICAL FEATURE) | TARGET VARIABLE/LABEL |

5.4 | YES |

2.8 | NO |

3.9 | NO |

8.5 | YES |

7.6 | YES |

5.9 | YES |

6.8 | NO |

WE START BY SORTING THE FEATURE VALUES IN INCREASING ORDER:

SORTED F1 | TARGET VARIABLE/LABEL |

2.8 | NO |

3.9 | NO |

5.4 | YES |

5.9 | YES |

6.8 | NO |

7.6 | YES |

8.5 | YES |

NOW WE WILL CHOOSE EACH POINT AS THRESHOLD ONE BY ONE , 2.8 , 3.9 and so on . Below we display the splitting for one point , let say 5.4.

we perform similar splittings for all the data points , and whichever gives us the max IG is our first splitting point. If you cannot recall what IG is , this image might help:

Now , for further splits , similar approach is repeated on leaf nodes .

Although we could handle the problem by feature binning and converting the numerical features into categorical .

More to come!

The post Splitting nodes in DT for continuous features(classification) appeared first on 7 HIDDEN LAYERS .

]]>The post KNN and Probability (interview questions) appeared first on 7 HIDDEN LAYERS .

]]>KNN does not have a learning phase . Its a lazy algorithm that just finds the “k” nearest neighbours and performs the classification/regression task almost like a hard coded instruction and nothing “intelligent” seems to happen. While the idea behind the algorithm is pretty simple and straightforward ; it is this simplicity that leads to many possible questions , because when one tries to solve complex real life questions using such simple algorithms , many border cases must be considered.

lets try to answer a few .

We know KNN can be used to solve classification as well as regression. We start with problems faced during classification .

SUPPOSE YOU HAVE 4 CLASSES AND THE NUMBER OF NEAREST NEIGHBOURS (K) YOU CHOSE IS 30. FOR A CERTAIN POINT YOU GOT THE FOLLOWING RESULTS :

- CLASS 1 =10 NEIGHBOURS
- CLASS 2 =10 NEIGHBOURS
- CLASS 3= 6 NEIGHBOURS
- CLASS 4= 4 NEIGHBOURS

NOW WHAT SHOULD OUR TEST POINT BE CLASSIFIED AS ? FIRST LETS CONSIDER WHAT THE NEIGHBOURS FROM THE 3RD AND 4TH CLASS TELL US. CONSIDER A MEDICAL CASE , LET SAY CANCER DETECTION , WHICH CONSISTS OF 2 OUTPUT CLASSES , C1 AND C2 .

YOU USED KNN where k=10 AND GOT 6 AND 4 POINTS RESPECTIVELY AS THE NEIGHBOURS . JUST BECAUSE YOU HAVE MORE NEIGHBOURS OF CLASS 1 CAN YOU RULE OUT THE POSSIBILITY OF THE SECOND CANCER CELL BEING PRESENT ?

THE ANSWER IS NO.

In many cases you require to look at the “probabilistic ” results rather than the final selection using more number of neighbours . Considering the above case of 2 classes , following is how the results would differ .

- if we use simple majority vote KNN : output : cancer of class 1
- if we use probability scores : output : 60 % percent chance of cancer 1 and 40% chance of cancer 2 .

Now lets get back to the 4 class problem , of course there the maximum neighbours are from class 1 and class 2 . Even using probability scores if we need to give one final output as a solution to a business problem , how can we break the tie . In such cases a lot depends on the domain of the problem , but lets discuss 2 generic ways that we can use to break such a tie.

FROM CLASS 1 AND CLASS 2 WE CAN SELECT THE FINAL OUTPUT BY USING WEIGHTED KNN .In Weighted KNN we assign more weightage to points that are closer to the test point . So instead of just counting neighbours we can assign “an inverse distance relation” while calculating distance scores .

We used k=30 in the case provided. To beat the tie lets consider we use K=32 or 34 , now calculating the number of neighbours will remove the tie .

In KNN ,regression refers to returning either the average /Median of a certain continuous value associated with the nearest neighbours. The problem here is not related to probability , rather it is related to the presence of outliers .

An outlier can mess up the average score but median score would be more robust to such issues .

So in summary a KNN (where k=n) works well only if there is always one such class which dominates the rest n-1 classes in terms of majority . And remember because there is no such thing as ” training” in KNN hence one can do nothing except changing k values if you find the neighbours distributed randomly in n classes.

This article focuses on one of the many problems that one can face during interviews. Other problems and their solutions like kmeans++, kd-trees and more would be discussed in subsequent posts.

The post KNN and Probability (interview questions) appeared first on 7 HIDDEN LAYERS .

]]>The post The Fault In “GPT-3”. IS IT A HYPE? appeared first on 7 HIDDEN LAYERS .

]]>If you are familiar with the concept of attention and transformers , you must have come across the word “GPT” . Be it the earlier models like GPT-1 and GPT-2 or the recently released GPT-3.

GPTs are decoder only stacks (generative -pre trained models) developed by OPEN AI

Ever since GPT-3 was released platforms like twitter were flooded with posts that glorified the model and what it can do . The posts were written in a manner which would make any layman person perceive it as some sort of magic . Funny claims like “this is the end of software engineering” were made .

GPT-3 in fact is a milestone in NLP as it showed performance like never before . But one needs to understand the limitations and the reasons for such performance . Finally one can see that GPT-3 is far away from being labelled as “near to human intelligence”.

below you can see the architecture of GPT-1 model (one transformer decoder) .

**Further enhancements by varying layers and parameters led to GPT-2**

GPT-3 is structurally similar to what GPT-2 is. The main advancements are the result of an extremely large number of parameters that were used in training the model . Also the computing resources that were used were way more than any “normal ” research group can afford .

MODEL | NUMBER OF PARAMETERS | number of layers | batch size |

GPT-2 | 1.5 B | 48 | 512 |

GPT-3 SMALL | 125M | 12 | 0.5 M |

GPT-3 MEDIUM | 350M | 24 | 0.5 M |

GPT-3 LARGE | 760 M | 24 | 0.5M |

GPT -3 6.7 B | 6.7 B | 32 | 2 M |

GPT-3 13B | 13.0 B | 40 | 2 M |

GPT -3 175B OR ” GPT-3″ | 175.0 B | 96 | 3.2 M |

MAJORITY OF THE PERFORMANCE BENEFITS CAN BE SEEN COMING FROM THE ENORMOUSLY HUGE NUMBER OF PARAMETERS .

Well if you are thinking to train a gpt-3 model from scratch , you might need to think twice . Even for OPEN AI , the cost of training GPT-3 was close to **$4.6 million** . And at present computing costs training gpt 4 r gpt 8 might be too expensive even for such huge organizations .

Given GPT-3 was trained on common crawl data of the internet , the model was prone to “learn ” social bias against woman , black people and the hate comments that is present in abundance on the internet. Its not surprising these days to find two people cussing and fighting over any social media platform ,sad.

GPT-3 fails tasks which are very problem specific. You can expect it to understand and answer common daily life questions( even then there is no guarantee of cent percent accuracy. ) but it cant answer very specific medical case questions . Also there is no “fact checking mechanism ” that can ensure that the output is not not only semantically correct but is also correct as a matter of fact.

Direct implementation of transformers isn’t feasible considering the dimensionality of an image and train time complexity of a transformer . Even for people/organizations with huge computation power its overwhelming.

A RECENTLY PUBLISHED PAPER ” AN IMAGE IS WORTH 16*16 WORDS” HAS SHOWN TO USE TRANSFORMERS FOR CV TASKS . DO CHECK OUT THIS LINK:

. At the moment, not everyone **can** get **access** to it. OpenAI wants to ensure that no one misuses it. This certainly has raised some questions in the AI community and is debatable .

YES!!! any model till now is just miles away from achieving general intelligence . Even the research team of GPT-3 has clearly asked the media to not create a “FAKE BUZZ” and that even though this is a milestone for sure but it is not general intelligence and can make errors .

Given the access rights, the fact that you cannot train it , and even if you can it just would be a library implimentation like BERT , its expected only to know the theoretical part if you mention it in your resume.

😛

LINK TO GPT-3 RESEARCH PAPER : https://arxiv.org/pdf/2005.14165.pdf

The post The Fault In “GPT-3”. IS IT A HYPE? appeared first on 7 HIDDEN LAYERS .

]]>The post INTERVIEW GUIDE TO TSNE appeared first on 7 HIDDEN LAYERS .

]]>we will break down the entire algorithm and try to answer all the details that we can think of , if you find something missing , feel free to point it out in the comments section and I shall add it in later edits !

Lets start by looking at the timeline of various dimensionality reduction algorithms

observe how old PCA is . t-SNE is comparatively an extremely “young” algorithm . Although “BHtsne ” is something that you won’t find in many online courses or their playlists we will see how it differs/overcomes the shortcomings of t-SNE in brief.

**REMEMBER:** tsne is just a visualization tool and is not used to transform data to train a model .

The *T distribution*, also known as the Student’s *t*–*distribution*, is a type of probability *distribution* that is similar to the normal *distribution* with its bell shape but has heavier tails. *T distributions* have a greater chance for extreme values than normal *distributions*, hence the fatter tails.

In mathematical statistics, the **Kullback–Leibler divergence,** (also called **relative entropy**), is a measure of how one probability distribution is different from a second, reference probability distribution.

Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference.

interview question: ” Can KL divergence be considered as a distance metric?”

answer :NO , because you can see that the function is non-symmetrical in p and q, if it is used as a distance metric distance from p to q will not match distance from q to p .

before discussing how this cost function is optimized lets ask 2 questions

lets answer the above 2 questions ,

the difference in SNE and T-SNE is this replacement of gaussian distribution in q_{ij} expression . SNE uses gaussian distribution in the lower dimension too where as t-SNE replaces it with student’s t -distribution.

The reason behind it is simple: the t-distribution helps in solving the famous crowding problem .READ the definition of t distribution and you will see how having broader tails allows dissimilar objects to be modelled far apart . Have a look :

first thing to notice is that the cost function is non convex. we can use the standard gradients approach to solve the optimization . now one of the rather mathematical interview question is to prove that PCA and t-sne (also KNN) optimization problems converge : FOR THAT PLEASE REFER THIS.

this article was about the mathematical background and questions on t-sne . FOR THE PARAMETERS IN THE T-SNE ALGORITHM LIKE

- PERPLEXITY
- EXAGGERATION
- OPTIMISATION PARAMETERS LIKE :learning rate , momentum(for gradient descent) , gradient clipping

refer this :https://opentsne.readthedocs.io/en/latest/parameters.html

LINK TO THE RESEARCH PAPER OF TSNE: https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf

and after you complete reading this blog , take a break and listen to this :p

The post INTERVIEW GUIDE TO TSNE appeared first on 7 HIDDEN LAYERS .

]]>