Transformers for beginners | What are they and how do they work

May 18, 2024

Transformers came into our lives only a couple of years ago, but

they

have been taking the NLP area by storm. Libraries like hug phase have made it very easy for everyone to use

transformers

or implementations like bert or gpt3 is the reason everyone talks about them. but

what

are

they

and how do they

work

? In this video we will take a closer look at

transformers

and understand their operating principles. This video is part of the deep learning series explained by Assembly AI, which is a company that is creating a state of play. -art api speech to text if you want to use ai assembler for free, get your free api token using the link in the description before the word transformers arrives.

We were using rnns to handle text data or any sequence data but the problem with rnns is that when you give it a very long sentence it tends to forget the beginning of the sentence when it gets to the end of the sentence and because they depend on recurrence, it's in the name recurrent neural net

work

, they can't be paralyzed, so we started using lstms lstms are a little more sophisticated, they tend to remember information for a little longer, but they take a long time to train well. So we have transformers. Transformers only rely on attention mechanisms to remember things.

More Interesting Facts About,

transformers for beginners what are they and how do they work...

They do not have any recurrence. and thanks to this they are faster because we can parallelize them we can train them in parallel okay but

what

is this attention we can definitely make another video to talk about it and if you are interested in that definitely comment and let me know but in general, attention is the ability of a model of paying attention to the important part of a sentence or a picture or any type of input, so if it's a sentence, this is what it would look like, so let's say we have a sentence in English and the sentence is the agreement on the European economic area that was signed in August 1992 and the other side is the French translation of that, but I don't know anything about French, so I'm not even going to try to pronounce it, but as you can.

As we see in this graph, what we see is that the lighter the color of the square, the more attention our model is paying to the word in that line or in that row or column and as you can see it does not always go diagonally when it is translating the European economic area because the order of the words is reversed in French. is paying attention in a reverse way if this was an image and let's say we are looking for dogs and images and we are trying to classify different breeds of dogs, then you can see what your model is paying attention to, it's the dogs' noses, it's the ears of dogs, what exactly in an image is the model paying attention to in order to understand the difference between dog breeds?

Now that we briefly look at what attention consists of, let's see how transformer networks learn and what their architecture is like. That's pretty much what a network of transformers looks like, but we'll start from the higher levels and then we'll start to break everything down and understand how they work together, so the basics at a very high level what transformers have is an encoder and a decoder part. , but actually what they have is six encoders and six decoders, but basically the left right side is the encoders and the right side is the decoders, each encoder has a self-attention layer that pays attention to the sentence itself and a self-attention layer feedforward neural network and each decoder has two self-attention layers and a feedforward neural network layer, the parallelization comes from how we feed the data into this network, we feed all the words in the sentence at the same time to our network, specifically the encoder within the first step, which is the self-attention sublayer, all the words in the sentence are compared with all the other words in the sentence.

Therefore, there is some communication between the words, while in the next step of the feedforward neural network they pass through a separate feedforward neural network, so they do not have any information exchange except the neural networks. of direct feed through which they pass. They are the same within the same layer, but as we said, there are six encoders and each one in each of these six encoders, the neural networks are different, so this has been a kind of middle part of the network, we also have the inputs and the outputs, so all the inputs that go to the encoder or the decoder, the raw inputs, are embedded, what are embeddings, well that's a little bit longer topic for this video, but again, yeah want us to make a video about this, leave a comment. but what you need to know for now is that embeddings are a way to represent these words in a vector of length n.

In this specific transformer architecture, they are using vectors of length 512 and that is basically what they use in the original article, but this is a hyperparameter that you can change and in addition to these word embeddings we are adding positional encodings, so if you remember that We said that transformers do not have any recurrence, so the model has no way of understanding which word comes first and the other second or which word comes where. in the sentence, by adding a positional encoding you are allowing or you are adding some information or injecting some information with each word that tells the modal where this word comes in the sentence and lastly for the output, as you can see, we have a linear coding. layer and a softmax layer at the end of the decoders so that the output of the decoders can be transformed into something we can understand and basically what they become is a vector that has the length of the number of words we have in our vocabulary . and each of these cells tells us how likely it is that this word in this cell will go through the next word in our sequence and those are the main components, but there are two little things that make transformers a little better, one of they are the normalization layers, so if you notice that between the sublayers are the self-attention layers and the feedback neural networks, we have some layers added and normalized and what they do is they normalize the output that comes from the sublayer , the normalization technique.

What is used there is called layer normalization and it is basically an improvement over batch normalization and if you don't know what batch normalization is, we already made a video on that, I will link it somewhere here and you can watch it to understand it . a little better what is batch normalization or layer normalization and the second little detail is the skip state, so if you look at the image of the original architecture, we see that there are some arrows surrounding some of the sublayers, in actually all sublayers, so some of the information that does not go to either the sublayers of the self-attention layer or the feedback neural networks is sent directly to the normalization layer.

This helps the model not to forget things and helps the model. To forward information that is important to move forward in the network within these normalization layers, what we do is we add the information that went through the sublayer and also just skip the sublayer and then normalize them together and that's all there is to the architecture of the transformers. and if you look at it, actually most of the things that are inside this architecture are things that we know about before they existed for a long time, for example, linear transformations, soft max layers or word embeddings for example, or neural feeding. networks, but there are two really noble ideas within the original Transformer paper that really made a difference for Transformers and they were positional encodings and multi-head attention, so let's take a closer look at how they work, let's start with multi-head attention layers heads, so if you look at the original architecture, you will see that there are two different types of multi-head attention layers, one of them is just multi-head attention, and the other is masked multi-head attention.

Well, it actually does the same thing regardless of whether it's called masked. or not, if it is in an encoder and a decoder and the only difference is in a normal multi-header attention layer, all the words are compared to all the other words that are entered that are in a sentence, but will have more meaning to you in a second. What I mean by compare and for multi-header masked attention layers, only the words that come before a word are compared with that word in the sentence in the attention layer, something called scaled product attention is used and then it's multiplied and done multiple times to create that multi-headed effect and of course it's all done on mattresses to make things faster, but I'll show you how attention is calculated using just the word vectors.

What do we have at the beginning? They are word embeddings. Remember that we embed the words in a vector and also add positional encodings and then this is sent to the first encoder and of course at first it is sent to the multi-header attention sublayer of the first encoder, so the first thing that is done there is multiply. These embedding vectors with some cushions are called query key and value cushions and are values that are randomly initialized and trained during the training process to learn them, kind of like the weights and biases we have in neural networks as a result from this.

By multiplying we obtain the query key and the value vector for each word and from this point we will use these vectors to continue with the calculation. The first thing we want to do is calculate a score for each word against all the others. words in the sentence, what we do for this is take the dot product of the curie vector of each word against the key vector of all the other words, so if you want to get the score of the first word in the first word, what we do What we do is get the dot product of the query vector of word one with the key vector of word one.

If we want to get the score of the first word vs the second word, we need to get the dot product of the query vector of the first word with the key vector of the second word, so that when we get the dot product of all the key vectors of all the other words compared or combined with the query vector of the first word, we will have the score of the first word against all the other words, so they all belong, all these scores belong to the first word, if you want To obtain the scores of the second word, we will have to multiply its query vector with all the key vectors of the other words and this is what is done and that's it. it is done in parallel, so we don't have any recurrence or we don't have to wait for other words to be completed before we start processing words later in the sentence.

We can do these calculations for all the words at the same time once. we have all the scores of all the words against all the other words, what we're going to do is divide them by eight, so it may seem like a very random number, but it's basically the square root of 64, which again is solid. like a random number that comes out of nowhere, but actually 64 is the square root of the length of the query key and value vectors and that's why the authors of the original Transformers article are using that number after dividing all times 8, we pass all these values to a softmax layer we do this to normalize all these values and the score values of a word vs all other words will now add up to one, the resulting number serves as a sort of weight from from this point we are What we are going to do is multiply all the value vectors of all the words with this weight and finally add all the weighted value vectors of all the words based on this word for which we were doing the calculations and create the output of self. -attention layer for this word and then you have to do all these calculations for all the other words, this is done simultaneously but in the end you have the result of the attention layer, as I mentioned, multi-headed attention does this eight times.

So effectively, you are training eight cushions of different values and query keys, not the vectors of the words, but the cushions with which we multiply the input embeddings. This way, the model can pay attention not to another word but to many other words in the sentence, so in the model they are using the number eight, but basically you can change it if you want and let's look at this example again that we gave at the beginning of this video, as you can see, there are some of the cells um that are actually bright but at the same time we also have some cells that are a little bit gray and that means that our model was also paying attention to those other words besides the main word that they are paying a little bit of attention to and this multi-headed attention This is also one ofthe reasons why transformers are so capable of handling sentences of different lengths.

One thing that you can see here is that if we do the same thing eight times, what we are going to have are eight different mattresses resulting, correct, and within these mattresses one line is going to correspond to a word but we are going to have eight of them so? how are we going to deal with this? What they do in the newspaper or what they propose to do is basically concatenate them all together. and then multiply them with another weight matrix which will produce a matrix that will look like just an output of the attention layer.

This weight matrix is of course just another thing to train inside the transformers in addition to the key value and curate cushions with which we multiply all the word embeddings with the following is positional encoding, so as I mentioned before, the Position line encodings are a way to inject or add information to the word embeddings we've created before to show where a word is in a sentence. So basically the location information of a word can be used either learned positional encodings or fixed positional encodings, but in the original article they suggest or recommend that we use fixed positional encodings because they have the advantage of being able to handle sentence lengths that we haven't seen in the training set, you might say, why do we need a fancy solution for this anyway?

Why can't we just assign a number to the word that specifies where in the sentence this word is so it doesn't actually work because let's say if you assign a number that goes from zero to one, what will happen is that it doesn't. You will really understand how many words are in that sentence just by looking at this word and this value will not be consistent between examples, another solution could be to assign integers to the words, of course starting from one or zero until however long the sentence is, but The problem with that one is that those numbers can get very high if you have a very long sentence that could get out of control and on top of that, there could be sentences with specific lengths that you don't have in the training data and that could cause some problems in terms of generalization, so what they did as a solution to this positional problem in the original Transformers paper. was to use sine and cosine functions at different frequencies, but of course I don't expect you to know this, so let's see what it looks like, this is what the sine and cosine functions look like at different frequencies.

The colors here show us numbers that vary. from -1 to 1. the x axis shows us the length of the word embeddings in transformers that we are using 512 as I mentioned before and the y axis is the position of this token of this word, so if I want to get the encoding position of a word that is, say, at position 20. I need to get the horizontal line that corresponds to 20 on the y-axis and the nice thing about this positional encoding is that it will be unique nowhere else, no other horizontal line. in this graph it has the same composition of values as in that line and another good thing about these positional encodings is that you can always tell the difference between two words by looking at these positional encodings it will always be the same which really helped I understand that this concept was looking at binary representations of integers, so let's look at these examples if you realize that as you increase your numbers, what happens is that the smallest digit in the binary representation changes from one to zero with each new integer, while that the second digit changes. every two integers then at first it is zero and zero and in the second two integers it is one and one and in the third two integers it is zero and zero again and again this pattern follows itself and what happens They are all these binary representations.

They are unique, no two binary representations are the same, and you can always tell the difference between two integers by looking at their binary representations. This could also be a perfectly useful positional encoding for us, but it's just ones and zeros, and we're not actually using the information that can be provided with continuous values, which is why we use sine and cosine functions. Well, what do we do once we have these encodings correct? Let's say we have this encoding of 512 values that we extracted from this graph. What we do is we basically add them together, we add the word embedding and the positional encoding and then we pass it to the encoders, so we learn everything we need about the architecture.

There are encoders, specifically six of them, and there are recorders. again, six of them we have the last processing and the output, we have the embeddings on the input and also the positional encodings, but how does it all work? Basically, to put it together, what happens is you first get your inputs and run them through the embeddings. run them through the positional encodings and then run them through six levels of encoders and then you will get an output, this output is sent to all the decoders, so we have six decoders, as we mentioned, six layers of decoders, this is the information from the output of the encoder is sent to all decoders, but this information is only sent to the second sublayer, so the multi-header attention sublayer of the encoders and the first multi-header attention masked, a sublayer of decoders, they get the input of what was generated from the decoder section of the model in the previous time step that way, the decoders take into consideration what the word was in the previous time step in the previous position and also the context that they learned from the network coding process.

To create the output all these decoders work together and then create an output vector to which this output vector is sent through linear transformation which creates a logical vector. This logical vector is as long as the number of words we have in our vocabulary and it has the probabilities of how likely the next word is to be one word or another and then we pass this through soft max so we can get the probabilities of each word and then these probabilities will also be added to one, so it's basically a normalized. version of the logit vector, the output of the softmax layer basically tells us what the next word will be and that's all there is to know about transformers.

It's actually quite simple, although at first it seems a little complicated. All you need to know. There are encoders and decoders and the two noble ideas that came into our lives with transformers are positional encodings and the multi-headed attention layer. To fully understand transformers and how they work, you may need to watch this video several times and perhaps even support your learning with some of the written resources out there, which is why I left links to my favorite resources in the description. If anything was unclear or if you have any questions, please leave a comment and let me know if you like this video.

Don't forget to give us a thumbs up and maybe even subscribe to be one of the first people to know when we make a new video, but before you go, don't forget to get your free token for the Assembly I Special Text API. See you at the next video

Watch Video & Subscribe

If you have any copyright issue, please Contact