The first and foremost task of any NLP problem statement is to perform text pre-processing and eventually find some way to represent texts as machine understandable numbers. This article assumes that the reader has an idea of one such model , known as Word2vec and hence is familiar with word embeddings ,semantics and how representing words in a certain dimension vector space can preserve semantic information as well as be more space efficient than one hot encodings /bag of words approach.
The statistics of word occurrences in a corpus isquoted from the glove research paper
the primary source of information available to all
unsupervised methods for learning word representations, and although many such methods now exist, the question still remains as to how meaning
is generated from these statistics, and how the resulting word vectors might represent that meaning.
Recall word2vec model . The aim/objective function of w2v (let say CBOW) was to maximize the likelihood/log likelihood of the probability of the main word given context words . Here the context words were decided by a “window” and only local information was used . This was a point raised in the glove research paper .”Global ” in the word glove refers to utilising the information of co-occurrence of words from the entire corpus .
Another point that was raised in the paper was regarding the cost complexity/efficiency of using soft-max in the w2v model. The research paper shows how glove model helps to handle this .
Lets start by understanding what is a co-occurrence matrix:
suppose we have 3 sentences :
- I enjoy flying.
- I like NLP.
- I like deep learning.
considering one word to the left and one word to the right as context , we simply count the number of occurrences of other words within this context . using the above sentences , we have 8 distinct vocab elements , (including full stop) and this is how the co-occurrence matrix X looks like:
First we establish some notation. Let the matrix
of word-word co-occurrence counts be denoted by
X, whose entries Xi j tabulate the number of times
word j occurs in the context of word i. Let Xi = summation over k(k Xik) be the number of times any word appears
in the context of word i. Finally, let Pi j = P(j|i) =Xi j/Xi be the probability that word j appear in the context of word i
Using the above definitions the paper demonstrates how the probability of finding a certain word in context given a certain word is found . Following points can be noted:
Points to note :
- notice how all the individual probabilities are in the order of 10^(-4 or lower) , this can be attributed to the fact that corpus size is huge and words and there context is a sparse matrix .
- It is difficult to interpret such differences as you can see there is no such pattern and all numbers are small.
- the solution the researchers gave was to instead use a ratio . as you can see from the third row of the table that it makes things interpretable.
- I f the word is more probable wrt context of ice the ratio would be a large number
- if the word is more probable wrt steam the ratio would be small number
- if the word is out of context from ice and steam both the value would approach one .
Notice how visualising the ratio gives significant information in large corpus .
THE GLOVE OBJECTIVE FUNCTION
The function F can be any function that solves our problem , (w’s are the 3 weight vectors of the 3 words ), further 2 points can be noted ,
- the RHS is a scalar quantity , The LHS operates on vectors ,hence we can use dot product
- In vector spaces its the difference in vectors that carries information , hence we can write F as:
but there is one more point !
WORD OR CONTEXT?
Any word will keep interchanging its role as “word” to “context word” .The final model should be invariant in this relabelling . Hence we need to take care to maintain such symmetricity .
To do so
consistently, we must not only exchange w ↔ w˜
but also X ↔ XT
The research paper cites it as follows :
Next, we note that Eqn. (6) would exhibit the exchange symmetry if not for the log(Xi) on the
right-hand side. However, this term is independent of k so it can be absorbed into a bias bi for wi.
Finally, adding an additional bias b˜k for w˜k restores the symmetry.
STILL FAR AWAY?
One major drawback at this stage is that suppose you have a context window of 10 words to the left and right and 10 to the right .Since there is no “weighing factor” involved the 10 th word is equally important as the word occurring right beside . This must be taken care of:
PROPERTIES OF F
- f should be continuous
- f(x) should be non decreasing so that rare occurrences are not over weighted
- f(x) should not explode or be too large for large values of x so that frequent occurrences do not get over weighted.
The research paper shows how it performs against the other well know algo , w2v
the glove model has sublinear complexity with value O(|C|0.8) .where c is the corpus size . Of course this is much better than worst case complexity O(V2 ).
once you read this blog, it would be easier to read the research paper,
Enjoy! here is something out of “context” :p