In RNN, LSTM the words are fed in sequence, and hence it understands the order of words. Recurrence in LSTM will require a lot of operations as the length of the sentence increases. But in transformer, we process all the words in parallel. This helps in decreasing the training time. To keep in mind the order of words, the concept of positional encodings is introduced. It’s a kind of encoding that denotes the position of words. In simple words, we add the positional encodings to our existing word embedding and that will give us the final pre-processed embedding which will be used in the encoder part.
Different Techniques of Positional embeddings:-
1. Position embeddings=Index of words
In this case , if the length of the sentence is 30. Then corresponding to each of the word , the index number can be its positional embedding as shown in the below image :-
We could use this way of encoding but the problem with this is as the sentence length increases , the large values of positional embedding dominates the original word embedding and hence it distorts the value of word embedding. So we discard this method for our natural language processing task.
2. Position= fraction of length of sentence.
If we convert the embedding values as a fraction of length i.e 1/N where N= number of words, it should work as the values will be limited between 0 and 1. The only loophole here is when we compare two different sentence of different length , for a particular index the positional embedding values would be different. In general, the positional word embedding should have the same value for a particular index for different length sentences or it will distort the understanding of the model. So we discard this method for our natural language processing task and we go for Frequency based method for positional encoding as mentioned in the original paper “Attention is all you need”.
3. Frequency based Positional embeddings
The author of the paper came up with an unique idea of using wave frequency to capture positional information.
For the first position embedding , pos=0 d= size of the embeding and should be equal to the dimension of existing embedding. i= indices of each of the positional embedding dimension / also denoted the frequency(i=0 being the highest frequency)
In the first sine curve diagram(where i=4), we have plotted the sine curve with different values of position, where the position denotes the position of the word. As the height of the sine curve depends on position on x axis , we can use the height to plot word positions. Since the curve height varies in a fixed length and not dependent on text length, this method can help in overcoming the limitation previously discussed. Please check this video to know more.
Height of the Sine curve has Values between -1 and 1. And with an increase in length of the sentence, the positional encoding values remain the same. But in the smooth sine curve from below (where i=4), we see the word position 0 and word position 6 distance on the y-axis is very small. To overcome this, we increase the frequency(freq=Number of cycles completed in 1 second). If we do that, then in the first sine curve from above (where i=0), we see that the distance between position 0 and position 6 is clearly visible.
The author has used a combination of sin and cosine function to get these embeddings.
Let’s code this
Output preview be like:-
Here we can see the nearness/closeness of 1st and 2nd word is high, so cosine similarity is high while its far between 1st and 9th word, hence the cosine similarity is low.
So that’s it on Positional encodings if you like it feel free to share it with your friends. Until then,