Understanding Word embeddings ,Semantics , and Word2vec model methods like skipgram and CBOW.
remember machines can only operate on numbers . approaches like Bag of words ,tfidf provide a solution for converting texts to numbers , but not without many drawbacks . here are a few:
- as the dimensionality(vocab-size) increases , it becomes memory inefficient.
- Sine all the word representations are orthogonal , there is no semantic relation between words . That is the amount of separation between words like “apple” and “orange” is same as “apple” and “door”.
Therefore the semantic relation between words, which is really valuable information , is lost. We must come up with techniques which along with converting texts to numbers ,keeps in mind the space efficiency and also preserves the semantic relation among words .
EMBEDDINGS ARE USED IN MANY DEEP LEARNING NLP TASKS LIKE SEQ2SEQ MODELS, TRANSFORMERS etc.
before understanding how it does the job we discussed above , lets see what it actually is composed of .
Word2vec is 2 layer neural network that helps to generate word embeddings from a given text corpus in a given dimension
So essentially word2vec helps us create numerical mapping of words in vector space , remember , this dimensionality can be less than the vocab -size.
so what is the wor2vec model try to do:
it tries to make similar embeddings for words that occur in similar context.
it tries to achieve the above using 2 approaches algorithms :
- CBOW( continuous bag of words)
CBOW and SKIP-GRAM
CBOW :A MODEL THAT TRIES TO PREDICT THE TARGET WORD FROM THE GIVEN CONTEXT WORDS
SKIPGRAM: A MODEL THAT TRIES TO PREDICT THE CONTEXT WORDS FROM THE GIVEN TARGET WORD
here is how we define a loss function : p(wt+j|wt) for skipgram ; p(wt|wt+j) for cbow:
lets break it down. Suppose we define “context” to be a window of m words , we iterate over all such windows from the beginning of our sentence and try to maximize the conditional probability of a certain words occurring given a context word or vice versa.
- LETS SAY THAT YOUR VOCAB SIZE IS V .
- WE PERFORM ONE-HOT ENCODING FOR ALL WORDS .
- WE CHOOSE OUR WINDOW SIZE AS C.
- WE DECIDE THE DIMENSON OF THE EMBEDDING DIMENSION , LET SAY N .OUR HIDDEN LAYER WILL HAVE N NEURONS.
- THE OUTPUT IS A SOFTMAX LAYER OF DIMENSION V.
- WITH THESE THINGS WE CREATE THE BELOW ARCHITECTURE.
WE PERFORM THE TRAINING UPDATE THE WEIGHTS , AFTER TRAINING THE WEIGHT BETWEEN HIDDEN LAYER AND OUTPUT SOFT-MAX IS OUR WORD EMBEDDING MATRIX . NOTE ITS DIMENSION IS (OUR SELECTED LOWER DIMENSION)*(VOCAB-SIZE) . HENCE IT CONTAINS THE N-DIMENSIONAL VECTOR SPACE REPRESENTATION OF ALL THE WORDS IN THE VOCABULARY.
THE WORKING OF SKIP GRAM IS JUST THE OPPOSSITE . THERE WE USE ONE WORD TO PREDICT THE CONTEXT WORDS.
NOW THE FINAL STEP, TO OBTAIN WORD VECTOR OF ANY WORD, WE TAKLE THE EMBEDDING MATRIX THAT WE TRAINED AND MULTILY IT BY ITS ONE HOT-ENCODING REPRESENTATION.
ITS ALWAYS HELPFUL TO REMEMBER THE LOSS FUNCTION MENTIONED ABOVE IN INTERVIEWS!