Hey All, this will be as series of videos where i will be covering Natural Language processing through Transformers and other derivative architectures like BERT, their variations ,GPTs and more
this video is an introduction to Natural Language processing . It will be a series of videos that will constitute of NLP (theory and applications) from scratch
In this post we try to discuss Basics of cloud , Artificial intelligence in general and also how google cloud AI provides services for the same . Before getting into Google cloud services in particular , lets try to understand the purpose of cloud computing and the resources they offer.
What is Cloud computing?
Over the last decade a lot of services have turned online. Movies, songs, shopping and storage services and much more. Many businesses are now providing online services/shops/courses.
Traditionally to come up with your own online service , you would have required to buy and assemble hardware devices for computing , storage and security purposes. You would have required to build giant servers and data warehouses .
And if your business had to upscale itself you would have required setting up new hardware devices according to need.
Yet another problem would have been handling customer load and also your hardware location to become a single point failure service (as in case of any technical issue at that location , your entire service would go down.
Google Cloud or cloud computing in general tries to provide a solution for these problems. Imagine someone providing a service that allow you to focus on building your product, its marketing and provides you with managed hardware resources , ensures data security and allows you to scale up and down according to the need of the hour.
To summarize The features of cloud computing are:
CloudFeature
Purpose
Maintenance
You can choose what level of administrative of management you want to handle and what the cloud service provider should handle
data security
Cloud services like google cloud /AWS are highly secure and you can be ensured of data security
multi-point failure
Collapse of a certain data center/Temporary issues of a certain data center wont cause your service to fall down
Storage services
You get huge storage capacities
Pay as you go
Possibly one of the best features of the cloud is that it assigns compute resources according to the traffic on your service and hence you pay for only what you use .
Hosting services
It provides you to host applications on the cloud ,
BIG data and machine learning (cloud AI)
You can run big data clusters on cloud . Also you can run machine learning and deep learning models (training and deployment ) on the cloud
Now we have a basic understanding of what cloud computing is, lets have a brief discussion on AI.
What is AI?
AI models are softwares that can make human-like intelligent decisions/predictions when introduced to new or real world data . Intelligent learning based softwares are required when rule based softwares are not enough to solve a problem.
We won’t dive into great detail regarding the mathematics/Algorithms behind ml dl or AI in this post but lets list down a few cases where online businesses utilize AI:
recommendation systems (Netflix ,amazon , flipkart)
Language translation tasks
Sentiment analysis in tweets, comments and products reviews
Now lets focus our attention to discuss the Google cloud AI services in particular
GCP allows you to prepare your data , train models and deploy models into production.
Also GCP AI and machine learning products helps you to select features according to your specific needs. Below are the GCP AI and ml products:
Build with AI – Products like Auto ml , Ai infrastructure
Conversational AI- APIs for speech to text and text to speech conversion
AI for documents-Natural language , translation tasks computer vision tasks
AI for industries- Media translation , Recommendation systems, Healthcare NLP
As we have seen that GCP allows you to use big data queries to get and preprocess your data , bwlow you can visualize the entire workflow of the AI project cycle from beginning to deployment . Notice how one can build notebooks ,pipelines all on cloud premises itself.
In this video we try to understand what actually a GAN tries to achieve. GANs have become the synonym of ” fake faces” and it hides the actual mathematical problem statement , here we try to explore that.
Weight initialization:- Any deep neural network (with similar structure as below) has “weights” associated with every connection . “Weights” can be viewed as how important , (with either positive or negative correlation ), any input feature holds while producing the final output .
Of course these weights are what a network learns in course of its training phase . But in order to update these numbers eventually , we must start somewhere . So, the question is , How to initialize the weights of a network?
There seems to be various ways in which one can do so , for eg ,
Symmetrically assigning all weights
Randomly assigning all weights
Assigning weights by sampling from a distribution
Lets look at each of the above cases and finally conclude some techniques that are used widely to initialize weights in deep networks.
Symmetric initialization
So, the first case is Wkij=c for all i,j ,k , that is weights of all layers and in between any two nodes are zero is same and has the value c . Since the value is same for all neurons , all the neurons would be symmetric(between neuron and all its subsequent connections) and will receive same updates . We want each neuron to learn a certain feature and this initialization technique wont let that happen .
To demonstrate the above see the below visualization:
using tensor flow playground initialize all weights to zero
let the model train
Notice how all the weights originating from a neuron are same .
you will see that every neuron in the first has unique but different weight value for all its connections, but after that between any layer L-1 and L all values are same as they are symmetric.
Random initialization
Of course assigning random values can help remove the problem of symmetricity but there is one drawback of random weight assignments . Absurdly large and small values of weights will result in a problem of vanishing gradients as at extremely large and small values a sigmoid functions derivative would be extremely close to zero , hence it will hinder weight updates after iteration .
Lets try to visualize the same in tensor flow playground :
lets assign some large positive weights and some large negative weights
notice how the training seems to halt even after numerous epochs
Xavier/Glorot initialization
A need for better and improved methods led to development of new activation function specific weight initialization techniques. Xavier initialization is used for tanh activations and its logic is as follows:
Xavier initialization tries to initialize the weights of the network so that the neuron activation functions are not starting out in saturated or dead regions. We want to initialize the weights with random values which are not “too small or large.”
NOTE: there are 2 variations of Xavier initialization: based on the sampling technique used (uniform or normal) .
Where for any neuron input=number of incoming connections , and output=number of output connections
But why does this work?
An interviewer can seemingly put numerous questions as to how this weird looking formulae came into existance . The answer lies in the following explanation.
Xa. weight initialization is trying to ensure that the dot product w.x that is fed to tanh is neither too large or low.
to restrict the overall quantity we can always have a control on x values ( normalization, also ensures that the data has zero mean and unit variance)
Now all that xavier in. does is that it ensures the variance of net product =Var(netj) of w.x is 1 .
image ref: quora
In the formula above , n is the number of incoming connections , the research paper in order to deal with outgoing connections also takes in account the number of outgoing connections and hence the final formula.
He initialization
In He weight initialization, (used for ReLu activations) we multiply random initialization with the below term (size l-1 is same as number of incoming connections)
As we saw in Xavier initialization , He initialization is also of 2 types , namely he normal and he uniform
A few other weight initialization techniques like “Kaiming initialization” are also there ,although i have rarely seen questions on it .
GloVe Model : -The first and foremost task of any NLP problem statement is to perform text pre-processing and eventually find some way to represent texts as machine understandable numbers. This article assumes that the reader has an idea of one such model , known as Word2vec and hence is familiar with word embeddings ,semantics and how representing words in a certain dimension vector space can preserve semantic information as well as be more space efficient than one hot encodings /bag of words approach.
The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised methods for learning word representations, and although many such methods now exist, the question still remains as to how meaning is generated from these statistics, and how the resulting word vectors might represent that meaning.
quoted from the glove research paper
“Global” vectors- glove
Recall word2vec model . The aim/objective function of w2v (let say CBOW) was to maximize the likelihood/log likelihood of the probability of the main word given context words . Here the context words were decided by a “window” and only local information was used . This was a point raised in the glove research paper .”Global ” in the word glove refers to utilizing the information of co-occurrence of words from the entire corpus .
Another point that was raised in the paper was regarding the cost complexity/efficiency of using soft-max in the w2v model. The research paper shows how glove model helps to handle this .
Lets start by understanding what is a co-occurrence matrix:
suppose we have 3 sentences :
I enjoy flying.
I like NLP.
I like deep learning.
considering one word to the left and one word to the right as context , we simply count the number of occurrences of other words within this context . using the above sentences , we have 8 distinct vocab elements , (including full stop) and this is how the co-occurrence matrix X looks like:
First we establish some notation. Let the matrix of word-word co-occurrence counts be denoted by X, whose entries Xi j tabulate the number of times word j occurs in the context of word i. Let Xi = summation over k(k Xik) be the number of times any word appears in the context of word i. Finally, let Pi j = P(j|i) =Xi j/Xibe the probability that word j appear in the context of word i
Using the above definitions the paper demonstrates how the probability of finding a certain word in context given a certain word is found . Following points can be noted:
Points to note :
notice how all the individual probabilities are in the order of 10^(-4 or lower) , this can be attributed to the fact that corpus size is huge and words and there context is a sparse matrix .
It is difficult to interpret such differences as you can see there is no such pattern and all numbers are small.
the solution the researchers gave was to instead use a ratio . as you can see from the third row of the table that it makes things interpretable.
I f the word is more probable wrt context of ice the ratio would be a large number
if the word is more probable wrt steam the ratio would be small number
if the word is out of context from ice and steam both the value would approach one .
Notice how visualising the ratio gives significant information in large corpus .
THE GLOVE OBJECTIVE FUNCTION
The function F can be any function that solves our problem , (w’s are the 3 weight vectors of the 3 words ), further 2 points can be noted ,
the RHS is a scalar quantity , The LHS operates on vectors ,hence we can use dot product
In vector spaces its the difference in vectors that carries information , hence we can write F as:
but there is one more point !
WORD OR CONTEXT?
Any word will keep interchanging its role as “word” to “context word” .The final model should be invariant in this relabelling . Hence we need to take care to maintain such symmetricity .
To do so consistently, we must not only exchange w ↔ w˜ but also X ↔ XT
The research paper cites it as follows :
Next, we note that Eqn. (6) would exhibit the exchange symmetry if not for the log(Xi) on the right-hand side. However, this term is independent of k so it can be absorbed into a bias bi for wi. Finally, adding an additional bias b˜k for w˜k restores the symmetry.
STILL FAR AWAY?
One major drawback at this stage is that suppose you have a context window of 10 words to the left and right and 10 to the right .Since there is no “weighing factor” involved the 10 th word is equally important as the word occurring right beside . This must be taken care of:
where V is the vocabulary of the corpus ,notice how glove is a “regression model”
PROPERTIES OF F
f should be continuous
f(0)=0
f(x) should be non decreasing so that rare occurrences are not over weighted
f(x) should not explode or be too large for large values of x so that frequent occurrences do not get over weighted.
where alpha =0.75 , Xmax=100 ( experimental reasons, can be changed)
PERFORMANCE COMPARISION
The research paper shows how it performs against the other well know algo , w2v
Complexity
the glove model has sublinear complexity with value O(|C|0.8) .where c is the corpus size . Of course this is much better than worst case complexity O(V2 ).
once you read this blog, it would be easier to read the research paper,