In this article we will try to understand the complex “transformers” structure , there are a lot of components involved within the structure and i hope you are familiar with concepts like attention , embeddings , RNN’s and the problems associated with them .
the problem which people encounter in understanding something with too many components is that they lose the track of the objective and its just random mathematical matrix multiplication going through their head.
Once you get a hang of transformers , architectures like BERT and GPT(2,3) will get easier to understand. And when you say that you “understand” any ML you should be confident enough to code it from scratch .
That said , lets make ourselves safe from the problem of getting lost in details by clearly stating the objectives . We need to learn the following:
- why are RNNs slow?
- are “bidirectional ” LSTMS really capturing context ?
- How transformers take all inputs at once and how does it preserve “sequence”? (positional encoding)
- What are query , key and values ? their physical significance and meaning i.e. what they mean to a layman .
- attention vs self-attention.
- the attention filter .
- what is masked attention?
- why do we need multi-head attention in Transformers?
- training and teacher forcing mechanism .
the above questions are the major things around which the general architecture of Transformers revolve around .So whenever you feel you are losing a track of where the things are going , just have a check .
so here’s how a transformer looks like :(on the left is one encoder block and on the right is one decoder block).
lets list out the components one by one and briefly see what they are before diving deep into stuff like query ,key ,values and self-attention.
- inputs : the text that you want to perform the processing on , (for ex. translation , dialogue etc.) . this is the raw text data .
- input embeddings : any computer will process numbers and not text data . hence the raw input data is transformed into vector space embeddings which preserve semantic relations . Hence the closer the words are in the transformed vector space , the more similar context they share .
- Positional encoding : RNNs are slow because all the words are processed one by one ,moreover they fail to conserve information over long “distances” . In contrast a transformer is fed all the input word embeddings at once . Now how would such an “all at once ” mechanism preserve the information of sequence ? The answer is positional encoding . the positional encoding helps the transformer to learn the order in any sequence.
THE ENCODER BLOCK
the encoder block can be seen to have the following components :
- Multi-head attention : the multi head mechanism takes the input word embeddings and uses query and key matrices(trainable )to provide output. Each “head ” is responsible for learning to pay attention to different features of a sentence. like preposition , verbs etc.
- residual layer : the residual layer is used to preserve important information over long distances.
- add and norm: the residual and multi-head attention outputs are added and normalized to be passed through a feed forward layer
- feed forward layer : the usual feed forward layer consisting of neurons and trainable weights . ( you can see one more residual layer and ADD+NORM layer here too , they serve the same purpose).
THE research paper :https://arxiv.org/pdf/1706.03762.pdf uses 6 such encoder layers and correspondingly 6 decoder layers . There is nothin sacred about the number “6” and one can experiment with these, or in general any hyperparameter .
THE DECODER BLOCK
the overall structure is not very different from the encoder block , the differences are :
- the query and key values of every decoder block is coming from the final encoder block .
- the multi-head attention mechanism has a new feature called “masking”.
- Finally the outputs are passed through a linear layer and then to a soft-max layer which predicts which word has the highest possibility to be the next output (in case of translation). The dimensions of the soft-max layer is equal to the vocabulary of the target language .
lets start discussing the important parts :
Positional encoding is done in transformers to prevent the sequential information from getting lost (because all the embeddings are provided at once and not in time steps). below is the formula for creating positional encodings .
Lets understand all the terms one by one . suppose your input embedding is of dimension d model and its position in the sentence is pos. Now you want the value of the i th dimension ( i<=d model) . THEN DEPENDING ON WHETHER i IS ODD OR EVEN :
below is an image of positional encoding obtained from tensorflow
point to note : even though you may think that since the functions are sinusoidal there will be periodic repetition , observe carefully the presence of “i” takes care of that that ,hence in different frequencies encoding will change .
“bidirectional LSTMs and GRUs are basically outputs of 2 unidirectional lstms concatenated together. So it isn’t much of “bidirectional” . On the other hand given enough resources(computational power) theoretically Attention can look at at an infinite length context window “
the basic difference between attention and self-attention is that attention mechanism we discussed in the previous post tried to tell which words from a sentence are more important in reference to a given word. In self attention we try to learn how words are related to other words of the same sentence . See the following to get an understanding :
BUT HOW DO WE GET THESE ATTENTION VALUES ?
Look at the operation displayed below , you can see three matrices , Q(query) , K (key ) ,V(value) . the input position encoded values are passed to these linear layers (whose weights are these matrices ) .
Then these values are scaled by a value by square root of d . and then passed through a soft-max layer.
This output from the soft-max layer is called the “ATTENTION FILTER related to this key ” .
See how the value matrix output is not undergoing any other preprocessing and hence is basically an encoded version of the original input. now when we multiply the attention filter to this value ( in the mat-mul block) what we get is physically ” the attention applied to the original value” and tells us what values we are focusing on . for example a certain filter will focus its attention on nouns”.
different attention “heads” try to learn the different features to consider while learning attention . Like in CNN different filters learnt the different features like eyes , nose, edges , hair etc. here different key values can learn different relations like prepositions, verbs ,names and other grammatical patterns .
after understanding this you can have a look at this 3-d visualization of multi-head attention :
image taken from :https://youtu.be/-9vVhYEXeyQ
In the decoder block while training we don’t want the decoder to pay attention to the future words , so we “mask ” the future values in attention filter obtained after the certain timestep .
Hence at any moment we have query and key matrix weights from last encoder and masked attention till the present decoder step.
the training method we use here is “teacher forcing”. Transformers are slow to train (time complexity is quadratic in N) . Hence its better to use pre-trained architectures like BERT to best utilize transformers.
IT IS SIMILAR TO WHAT YOU DID IN TRAINING THE ATTENTION MODEL IN NEURAL TRANSLATION . AND DURING BACKPROPAGATION WE LEARN THE QUERY , KEY AND VALUE WEIGHTS.
I THINK I HAVE ANSWERED ALL THE QUESTIONS , FEEL free TO EXPLORE MORE RESOURCES , ESPECIALLY THE SOURCE CODE AT TENSORFLOW :
IN A FUTURE POST WE SHALL DISCUSS BERT model