Intuitive Maths and Code behind Self-Attention Mechanism of Transformers for dummies

This blog post will get into the nitty-gritty details of the Attention mechanism and create an attention mechanism from scratch using python. The codes and intuitive maths explanation go hand in hand.

What are we going to Learn?

  1. Attention Mechanism concept
  2. Steps involved in Self Attention Mechanism (Intuitive mathematical theory and code)
    • Input Pre-Processing
    • Role of Query, Key, and Value matrix
    • Concept of Scaled Attention Scores
  3. Multi-Head Attention Mechanism

As discussed in the previous post, what happens when a sentence passes through an attention mechanism. For e.g. say we have a sentence “The elephant ate the banana as it was hungry“, the attention mechanism creates a representation(embedding) of each word by keeping in mind how each word is related to other words in the sentence. In the sentence above, the attention mechanism understands the sentence so much that it can relate the word it with Elephant and not with banana.

This image has an empty alt attribute; its file name is image-4.png

Steps involved in Self Attention Mechanism

1. Get the Input in proper format :-

We all know by now that text inputs are not suitable input to be interpreted by Transformer/computers. Hence we represent each word in a text with a vector of numbers. Let’s create embedding for a sentence for e.g.:- “This is book” and let’s assume the embedding dimension to be 5. So for each word we have a vector of length 5 as shown below.

From the input matrix above we will create couple of new matrices namely Key, Query and Value matrix. The matrix plays a vital role in the attention mechanism. Let’s see how?

2. Obtaining Query, Key and Value matrix

First, we need Query, Key, and Value weight matrix. For now, we have initiated it randomly, but in actuality, like any other weights in the neural network, these are parameters and learned during the training process. The optimal weights are finally used. Assume these weights are optimal weights as shown in the code. What we will be doing in the code section is summarized in the below diagram:-

These weights will then be multiplied by our input matrix (X) and that will give us our final Key, Query and value matrix

The first row in the query , key and value matrix denotes query, key and value vectors of word “This” and so on for other words. Untill now , the query, key and value matrix might not make much sense. Lets see how self attention mechanism creates a representation(embedding) of each word by finding how each word is related with other words in the sentence by using the query, key and value vectors.

3. Scaled Attention Scores

The formulae for Scaled attention score :-

What happens in Q.K(transpose) is a dot product between query and key matrix and dot product defines similarity as shown in the image below.

Note:- The numbers are all made up in the image below for the sake of explanation and do not add up.

So there is a dot product between query vector q1(This) and all key vectors k1(This),k2(is),k3(book) . This computation tells us how query vector q1(This) relates/similar to each vector in the key matrix k1(This),k2(is),k3(book). Also if we focus on the final output matrix , we can see that each word is related to itself more than any other words in the sentence as shown by the diagonal matrix . This is because the dot product values are higher. Second to that, the word “This” is more related to “book” as highlighted in red in the above image. As seen in the last part of code , we divide the Q.K(transpose) by sqrt(dimension). This is a kind of normalization step that is being done here to make the gradients stable.

Softmax in the below code helps to bring it in the range of 0 and 1 and assign probability values.

The above matrix is an intermediate softmax scaled attention score matrix where each row corresponds to the intermediate attention score /probability score for each word in sequence. It shows how each word is related to other word in terms of probability. To get the final attention vector we will multiply the above score with Value matrix and sum it up . The three attention vectors corresponding to the word “This” is summed up.

In the below code snippet, the softmax_attention_scores[0][0] is the weightage of that particular word and values[0] is the value vector corresponding to word “This” and so on.

Similarly we can calculate attention for other words like is and book. This is the mechanism of Self-Attention . Next we will look into multi-head attention mechanism, which has its underlying principle coming from Self-Attention Mechanism.

Multi-Head Attention Mechanism:-

In simple words, the Multi-head attention mechanism is nothing but multiple self-attention mechanisms concatenated together. If we term each self-attention flow/process as a head , then we will get multi-head attention mechanism by concatenating all self-attention mechanisms together.

When we will do hands-on in upcoming blog post we will see the output of each encoder has a dimension of 512. And there is a total of 8 heads. So what happens is, each self-attention module is made such that it gives output to a (no_of_words_in_sentence,64) dimension matrix. When all these dimensions are concatenated, we will see the final matrix will be of (no_of_words_in_sentence,(64*8)=512) dimension. The last step is to multiply the concatenated heads with a weight matrix (assuming that the weight matrix has been trained during the process) and that will be the output of our Multi-head Attention.

In our next blog post we will discuss about Hugging face implementation of transformers, Until then Goodbye. If you find this helpful, Feel free to share this with your friends

Leave a comment

Your email address will not be published. Required fields are marked *