The math behind Attention: Keys, Queries, and Values matrices

May 06, 2024

Hello, my name is Louis Serrano and in this video you will learn about the

math

ematics behind

attention

mechanisms in large language models. Attention mechanisms are really important in large language models, in fact they are one of the key steps that make Transformers work very well now. In a previous video I showed you how

attention

works at a very high level with words flying towards each other and gravitating towards other words so that the model understands the context. In this video we are going to do an example. Now, as I mentioned before, a lot more details were introduced with all the

math

ematics involved, the concept of transformer and the attention mechanism in this innovative paper called attention, it's all you need.

Now, this is a series of three videos. In the first video I showed you what The attention mechanisms are at a high level in this video. I'll do it with math and in the third video coming up I'll put them all together and show you how a Transformer model works, so in this particular video, To learn about some concept of similarity between words and pieces of text is one of the concepts , one way to do it is with DOT product and another with cosine similarity, so we will learn both and next you will learn what is the query and the key value.

matrices

are linear transformations and how they play a role in the attention mechanism, so let's do a quick review of the first video, first we have embeddings and embeddings are a way of putting words or longer pieces of text, in this case it is the flat, but in reality, you put it in a high-dimensional space in such a way that words that are similar are sent to points that are close, for example, these are fruits, there is a strawberry, an orange, a banana and a cherry, and they are all in the top corner. the image because they are similar words, so they are sent to similar points and then here we have a lot of brands, we have Microsoft, we have Android, then we also have a laptop and a phone, so it is the technology corner and then the question that What I saw in the previous video is where would you put the word apple and that's complicated because it is both a technological brand and also a fruit, so we wouldn't know in particular, let's take a look at this in the oranges in the top right and the phone.

More Interesting Facts About,

the math behind attention keys queries and values matrices...

It's at the bottom left, where would you put an apple? Then you need to look at the context, so if you have a sentence like please buy an apple and an orange, then you know you're talking about the fruit if you have a sentence like apple. we introduced a new phone, then you know you are talking about the technology brand, therefore this word must be given context and the way it is given context is by the neighboring words in particular, the word orange is the which helps us here, so what we do is look at where the orange is and then move the apple in that direction and then we will use those new coordinates instead of the old ones for the application, so now the apple is closer to the fruits so you know more about its context given the other words the word orange in the sentence now for the second sentence the word that gives us the clue that we are talking about a technology brand is the word phone, therefore what we do is to move towards the word phone and then We use those coordinates in the embed so that Apple in the second sentence knows that it's more of a tech word because it's closer to rewarding phone.

Now another thing that we saw in the previous video is that not only the word orange is going to attract Apple. but all the other words are going to pull the apple and how does this happen? This happens with gravity, or actually something very similar to gravity. Closed words like apple and orange have a strong gravitational attraction, so they move towards each other in the On the other hand, the other words do not have a strong gravitational attraction because they are very far away, they are actually not very far away, but They are different, so we can think of distance as a metric, but I'll tell you exactly what in a minute.

I'm talking about here, but we can think that the words are very far away and, as I said, the words that are closed get together and the words that are very far away age, but not much, so after a gravitational step, the word appears apple and orange are much closer and the other words in the sentence well, they can get closer but not that close and what happens is that the context pulls, so if I have been talking about fruits for a while and I said banana, strawberry, lemon, blueberry and orange and then If I say the word Apple, then you would imagine that I'm talking about a fruit and what happens here in space is that we have a galaxy of fruit words somewhere and they have a strong group, so when The word Apple appears, it is dragged by this Galaxy and therefore now Apple knows that it is a fruit and not a technology brand.

Now remember I told you that words are very far away, but that's not really true, actually what we have to do is that the concept of similarity, so what is similarity? just as humans have an idea that words are similar or different from each other and that is exactly what similarity measures, so before we saw, for example, that the words cherry and orange are similar and different to the word phone and we had the impression. that there is a measure of distance like the cherry and the orange are closed so they have a small distance between them and Cherry and the phone are far away so they have a large distance but as I mentioned what we want is the opposite image of similarity that It is high when the words are similar and low when the words are different, so below I will show you three ways to measure similarity that are actually very similar in the end, the first one is called a dot product, so imagine that you have these three words here Cherry. orange and phone and as we saw before the axis in the embedding actually means something, it could be something tangible to humans or maybe something the computer just knows and we don't, but let's say the axes here measure, check the horizontal axis and Fruit for the vertical axis, so the orange cherry has high fruit and low technology, that's why they are located at the top left and the phone has high technology and low oven, that's why it is located at the bottom right .

Now we need a measure that is high for these two numbers cherry and orange, so the similarity measure will be phone. We look at its coordinates 1 4 for cherry and zero three four orange and we remember that one of them is the amount of technology and the other is the amount of fruits. Now, if these words are similar, we would imagine that they have similar amounts of technology and similar amounts of fruit, in particular, they are both low technology, therefore, if we multiply these two numbers, it should be a low number. that's one times zero, but they both have high fructose, so if we multiply those two numbers four by three we get a high number and when we add them together, that's the product of technology and the product of fruit, we get a high number, which is 12.

Now let's do the same with Cherry and Phone, so the similarity should be a small number. Let's look at between 1.4 and 3.0, what the dot product is. Well, it is one times three the products of technological

values

plus four times zero that is the problem. of fruitiness

values

and that is one times three plus four times zero, which is a small number, which is three. Note that the reason is because if one of the words has low technology, the other has high technology and one of them has low fruitiness and has high fruitfulness, so we are not going to get very high by multiplying them and adding them, and the extreme case It's the orange foam.

This orange phone coordinates are zero three and three zero, so when we multiply 0 by 3 we get zero plus three times zero. is equal to zero and then we get zero. Notice that these two words are actually perpendicular to the origin and when two words are perpendicular they will always have a dot product equal to zero, so the dot product is our first measure of similarity. It is high when the words are similar or close embedding and low when the words are very far away and note that it could also be negative. The second measure of similar is called cosine similarity and it looks very different from that product, but they are actually very similar to what we do. here we use an angle, so let's calculate the cosine similarity between orange and cherry.

We look at the angle the two vectors made when they were drawn from the origin. This angle is 14. If you don't calculate it, it is actually the arctangent of a quarter. and here it is so that number is 0.97 that means that the similarity between Cherry and Orange the cosine similarity is 0.97 now let's calculate the one between Cherry and Phone that angle is 76 because it is the arc 10 of 4 divided by one and that number is 0.24 so the similarity between cherry and orange is 0.24 which is much less than 0.97 and finally guess what will be the similarity with an orange and the phone will be the cosine of this angle which is 90 and that is again zero so the cosine similarity zero similarities since is a cosine is between 1 and -1 so one is for words that are very similar and then zero and negative numbers are for words that are very different now I told you that the similarity of the dot product is very similar but they don't look that similar. why are they so similar?

Because they are the same if the vectors have length one more specifically if I were to draw a unit circle around the origin and take each point and draw a line from the center to the point and put the word where the line meets the circle, that means that I reduce everything so that all vectors have length one, then the cosine similarity and dot product are exactly the same, so basically all my vectors have Norm 1, then the cosine similarity and dot product are the same, so At the end of the day what we are saying is that the scalar product and the cosine similarity are equal up to a scalar and the scalar is the product of the two lengths of the vector. so if I take the dot product and divide it by the parts of the lengths of the vectors, I get the cosine similarity.

Now there's a third one called the scalar dot product which as you can imagine is another multiple of the dot product and it's actually the one used in attention so let me quickly show you what it is, it's the dot product like before so here we get 12, except now we are divided by the square root of the length of the vector and the length of the vector is 2 because these vectors have two coordinates and so we get 8.49 for the first, for the second we had a 3 divided by root of 2 is 2.12 and for the third good we had a zero and that divided by root of 2 is also 0.

Now the question is why are we dividing by this square root of 2? This is because when you have very long vectors, for example with a thousand coordinates, you get really very large dot products and you don't want what you want. To manage these numbers, you want these numbers to be small, that way you divide them by the square root of the length of the vectors. Now, as I mentioned before, the one we use to get attention is the scalar dot product, but for this example just to get a good idea. numbers, we'll do it with cosine similarity, but at the end of the day remember that everything scales with the same number, so let's look at an example, we have the sentences an apple and an orange and an Apple phone and let's calculate some similarity. so let's first look at some coordinates, let's say the orange is at position 0 3, the phone is at position zero and this ambiguous apple is at position two two now the embeddings not only have two dimensions, they have many, so to To make it more realistic let's say we have three dimensions so there is another dimension here but all the words we have are at zero on that dimension so they are on that plane next to the wall however the sentences have more words , words and an ending, so let's say that and an N are here at coordinates 0 0 2 and 0 0 3. now let's calculate all the similarities between all these words that will be this table here.

The first easy step to notice is that the cosine similarity between each word and itself is one why good because each angle between each word itself is zero and the cosine is zero is one so they are all one now let's calculate the similarity between orange and phone we already saw that this is zero because the angle is 90. now we are going to do the angle between the orange and the apple is 45 the same between Apple and the phone and the cosine of 45 degrees is 0.71 finally let's look at the phone and, actually, any word between the orange apple and the phone makes a 90 degree angle with an N, so all of these numbers are actually zero and finally between and and the angle is zero, therefore the cosine is one.

This is our complete table of similarities between the words and we are going to use this table similarity to move the words, that is the attention step. so let's look at this table, but just for the words in the sentence an apple and an orange, we have the capitalized words orange and an n and we're going to move them, so we're going to take them and change their coordinates slightly. Each of thesewords would be sent to a combination of itself and all the other words, and the combination is given by the rows in this table, so more specifically, let's look at Orange.

Orange would go to once for orange, which is this coordinate here plus 0.71. multiplied by Apple, which is this coordinate here plus 0 multiplied by n plus zero multiplied by n, which doesn't mean anything else. Now let's see where Apple is going. Apple goes to 0.71 multiplied by orange plus one multiplied by Apple plus 0 multiplied by n plus zero multiplied by n. Now let's see n. and n go to zero times orange plus zero times apple plus 1 times and plus one times n and the same thing happens with the word y goes to zero times orange plus zero times apple plus 1 times n plus one times n so basically what we do What What we did was take each of the words and we sent them to a combination of the other words, now Orange has a little bit of Apple and Apple has a little bit of orange etc. so we're just moving the words around and then I'll show you graphically what This means now. let's do it also for the other sentence a phone Apple this one I'm going to do it a little faster phone goes to once phone more 0.71 times apple more 0 times n Apple goes to 0.71 times phone more once apple more 0 times n and n go to 1 times n, so yes, however, there are some technicalities that I need to tell you, first of all, let's look at the word orange, it goes to one time orange plus 0.71 times Apple, but these are big numbers.

Imagine doing this many times. end up being shipped to 500 times Orange plus 400 times Apple. I don't want to have these big numbers. I want to scale down and everything, so in particular I want these coefficients to always add up to one so that it doesn't matter how many transformations I do. I'm going to end up with an orange percentage, an apple percentage, and maybe other word percentages, but I don't want them to explode, so to make the coefficients add up to one, I would divide by their sum 1 plus 0.71, so I get 0 .58 orange plus 0.42 times Apple, so that process is called normalization, however, there is a small problem, can you see it well?

What happens is this, let's say I have orange that goes to 1 times orange minus one time motorcycle because it remembers that cosine distance. can be a negative number, so if I want these coefficients to sum to 1, I would divide them by their sum, which is 1 minus one, and dividing by zero is a terrible thing, never ever divide by zero, so how do I solve this problem well ? I would like these coefficients to always be positive. Find a way to take these coefficients and turn them into something positive. I'm fine, however, I still want to respect the order.

One is much larger than negative one, so I want the coefficient that becomes 1. to be even much larger than the coefficient that becomes negative one, so what is the solution? Well, a common solution here is instead of taking a coefficient number then I have e a 1 times orange plus e a 0.71 times Apple divided by e a 1 plus 0.71 the numbers change slightly now there are 0.57 and 0.43 but what happens in the one below well one becomes e to one minus 1 becomes e at least 1 and now I add them and the bottom one becomes e at one plus e at least 1 and that becomes 0.88 orange plus 0.12 moto then we effectively convert the numbers into positive respecting their order, so we do this step so that the coefficients always add up to one.

This step is called Soft Max and it is a very popular feature in machine learning, so now when we go to the tables and how those tables created these new words, then we can change the numbers to the soft Mach numbers and get these new real numbers, so this is what will tell us how the words will move, but actually, before I show you this geometrically, I have to admit that I have been lying to you because softmax does not convert a zero to zero, in fact, these four numbers are they send e to one e to 0.71 e to 0 and e to zero and e to zero is one, so this combination of words, when you normalize it, actually becomes 0.4 orange plus 0.3 apple plus 0, 15 and plus 0.15, so it's not that difficult to get rid of those hands and ends, but as you can imagine, in real life they will have coefficients so small that they are practically insignificant, but at the end of the day you have to consider all the numbers when you do softmax, but let's go back to pretending that and are not important, in other words, let's go back to the original equations.

Apple goes to 0.43 Orange plus 0.57 Apple and Apple goes to 0.43 Phone plus 0.57 Apple, so if we forget about the words and the ending, let's go back to the plane here where the three words fit together very good on the plane, so let's look. In the equations again, Apple going to 43 orange plus 57 Apple really means we are taking 43 apple and turning it into orange geometrically, this means we are taking the line from Apple to Orange and moving the word Apple 43. in the path that gives us the new coordinates 1.14 and 2.43 for the second sentence we do the same, we remove 43 from the apple and turn it into the word phone, that means we draw this line from Apple to the phone and locate the new Apple. 43 on the way that means it is at coordinates 2.86 and 1.14 so what does this mean?

That means that when we talk about the first sentence we're not going to use the 2 2 coordinates for Apple that we're going to use. coordinates 1.14 and 2.43 and we are in the second sentence we are not going to use coordinates 2 2 we are going to use coordinates 2.86 and 1.14 so now we have better coordinates because these new coordinates are closer to orange or phone coordinates depending on which sentence the word Apple appears and therefore we have a better version of the word Apple now this is not much, but imagine doing this many times in a Transformer, the attention step is applied many times, so if you apply Many Times at the end the words will end up much closer than the context dictates in that piece of text and in a nutshell that's what attention does, so now we're ready to learn

queries

and key values and Matrices if you look at the original diagrams of scale.

The produced voltage on the left and the multi-head voltage on the right contain these KQ and V. Those are the key value and query

matrices

that we are going to denote by

queries

and key values. In fact, let's learn

keys

and queries first and we'll learn values later in this video, so let me show you how I like to look at key and query arrays. Now remember the above in this video when you want to do the attention step. take the embedding and then you move these ambiguous Apple towards the phone or towards the orange depending on the context of the sentence.

Now in the previous video we learned what a linear transformation is, it's basically a matrix that you multiply all the vectors by and you can get something like this, another embedding or maybe something like this, a good way to imagine linear transformations is to send that to the square to any parallelogram and then the plane follows because the square tiles the plane so you just keep tiling the print parallelograms from the plane and you get a transformation from the plane to the plane so these two examples here are linear transformations of the embedding original. Now let me ask you a question of these three, which one is the best to apply the attention step and which one is the worst. which one is like that, feel free to pause the video and think about it and I'll tell you the answer.

Well, the first one is so-so because when you pay attention, it separates the fruit from the apple of Apple technology, but not that much. A lot the second one is horrible because you apply the attention step and it doesn't really separate the two words, so this one is really bad, it doesn't add much information and the third one is great because it really spaces out the phone and the orange and therefore separates very well apple technology and apple fruit, so this is the best and the point of the

keys

and queries Matrix will help us find really good additions where we can pay attention and get a lot of good information.

Now how do they do it right through linear transformations? But let me be more specific. Remember that the attention was carried out by calculating the similarity. Now let's see how we did it. Let's say you have the vector for orange and the vector for phone in this example. they have three coordinates but they could have as many as we want and we want to find the similarity so the similarity is the scalar product, it was actually the scalar product scalar or the cosine distance but at the end of the day it's the same thing up to a scalar so Let's take them all in a similar way and the dot product can be seen as the product of the first times the transpose of the second it is a matrix product and as I said before, if we don't care so much about the scale, we can think of it as the distance of the cosine in that particular embedding now how do we get a new embedding?

This is where key and query arrays come in when we look at key and query arrays. What they do is modify the embeddings, so instead of taking the orange vector we take the range vector multiplied by the key matrices and instead of taking the phone vector we take the phone vector multiplied by the Queries Matrix and we get new embeddings and when we want to calculate the similarity, then it is the same, it is a product of the first times the transposition of the second, which is the same as the query transposition times the phone transposition and this here is a matrix that defines a linear transformation and that's the linear transformation that this embed takes on this one here so that the keys and the Matrix queries work together to create a linear transformation that will enhance our embed to be able to make a tension pattern, so what we're doing is to modify the similarity in one. embedding and taking the similarity into a different metric and we'll make sure this one is better, in fact we're going to calculate a lot of them and find the best ones, but that will come a little later, but imagine query keys and arrays as a way to transform our embedding into one that is more suitable for this attention problem and now that you have learned what the key and query arrays are, let me show you what the Matrix values are.

Remember that Matrix keys and queries actually trigger embedding. in one that is better for calculating similarities, however, the left embedding is not where you want to move the words, you just want it for calculating similarities. Let's say there's an ideal embed to move words around and it's the one here. So what do we do right by using the similarities we found in the inset on the left? Let's move the words in the insert to the right and why that is. What happens is that the insert on the left is actually optimized for finding similarities, while the insert on the left is optimized for finding similarities.

The embedding on the right is optimized for finding the next word in a sentence. Why is this because a Transformer does it? We're going to learn this in the next video, but a Transformer finds the next word in a sentence and continues finding the next word. next word until it generates long chunks of text, so if you want to move words around, it's one that is optimized to find the next word and remember that the embedding on the left is found using query keys and arrays and embedding the right. is found by the arrays of values and what the values are.

Matrix does this well, it is the one that takes the left embedding and multiplies each vector to get the right embedding, so that when you take the left embedding multiplied by Matrix V, you get another transformation because you can concatenate these linear transformations and you will get the linear transformation that corresponds to the embedding on the right now. Why is the left embedding the best for finding similarities? Well, this is the one that knows the features. of words, for example, it would be able to capture the color of a fruit, the size, the amount of fruit, the flavor, the technology found in the phone, etc., is what captures the characteristics of the words, whereas the embedding for finding the next word is one that knows when two words might appear in the same context, so for example if the sentences I want to buy are blank, the next word could be car, it could be apple , it could be phone, in the right embedding, all those words are close because the right embedding is good for finding the next word in a sentence, so remember that key and query arrays can capture the granular features of high level and low level words that the embedding on the right does not have. optimized for the next word in a sentence and that's how the key query and the Matrix of Values really give you the best additions to apply attention now just to show you a little bit of the math that happens in the Matrix of Values.

Imagine these are your similarities. thatyou found after the smooth Max function, that means that Apple goes to 0.3 orange plus 0.4 apple plus 0.15 and plus Point 15 and that is given in the second row of the table and when you multiply this matrix by the Matrix value, you get some other incrustation. so now everything here is something 4, it could have a completely different length because the Matrix value doesn't need to be a square, it can be a rectangle and then the second row tells us that instead Apple should go to v21 times orange plus V2 times apple plus V2 3 times n plus V2 4 times n, that's how you go in the Matrix value and transform the first embedding into another one, well that was a lot of stuff, so let's do a little summary on the left, you have the scale diagram of points. extension of the product and the formula so let's break it down step by step first you have this step here where you multiply K and Q you transpose what that is well that's the dot product and you're dividing by the square root of DK DK is the length of each of the vectors remember this is called a scalar scalar product so what we're doing here is finding the similarities between the words.

I'm going to denote the similarities by angles with cosine distance but you know what I'm talking about. scale the dot product instead, so now that we found the similarities, we move on to the step here with the soft Max, which is where we figure out where to move the words in particular, the technology that Apple moves towards the phone and the fruit of the apple moves towards the orange but we are not going to make the moves in this embedding because this embedding is not optimal so this embedding is optimal for finding similarities so we will use the Matrix values to convert this embedding into a better one.

V is acting as a linear transformation that transforms the left embedding into the embedding. on the right and in the embedding on the right is where we move the words because this embedding here is optimized for the function of the Transformer which is to find the next word in a sentence, so that is self-attention, now what is self-attention? multiple heads? well it's very similar except you used many heads and by many heads I mean many arrays of keys, queries and values here we are going to show three, but you could use eight, you could use 12, you could use many more and the more you use Obviously, better the more you use, the more computing power you will need, but the more you use, the more likely you are to find some pretty good ones, so this 3K and Q Matrix, as we saw before, form three embeddings where you can apply attention now, K and Q They are what help us find the embeddings where the similarities between words are found.

Now we also have a bunch of value arrays as many as the key and the query array are always the same number and just like before, these value arrays transform these embeddings where we find similarities into embeddings where we can move the words now here is the magic step how do we know which are good and which are bad? Well right now, we don't concatenate them first. Which means concatenate, well, yes I have. a table with two columns and another table with two columns and another table is two columns when I concatenate them I get a table with six columns geometrically that means that if I have let's say a two-dimensional embedding and another and another I concatenate them then I get an embedding of Six Dimensions now I can't draw in Six Dimensions, but imagine this here is a really high Six Dimensions key, so something with six axis in real life if you have a lot of big keys we end up with a very high dimensional key that is not optimal, which is why we have this linear step here.

The linear step here is a rectangular array that's going to transform this into a lower dimensional embedding that we can handle, but there's more to it. Matrix here learns which embeddings are good and which are bad, so for example the best embedding for finding the similarities was the third one, so this is scaled up and the worst embedding was the middle one, so this is scaled down . this Matrix here, this linear step actually does a lot and now, if we know which matrices are better than others and the linear step actually scales them well and scales the bad ones by a small amount, then we end up with a really good embedding and so on. we ended up paying attention to a pretty optimal embedding which is exactly what we want now.

I've done a lot of magic here because I haven't really told you how to find this query key and value arrays. I mean, they seem to work very well. works, but finding them probably won't be easy, that's something we'll look at more in the next video, but the idea is that this query and key value array is trained with the Transformer model. Here is a Transformer model and you can see that with multiple heads. The tension appears several times. I like to simplify this diagram into this diagram here where you have several tokenization steps that incorporate positional encoding, then a feed forward and an attention part that repeats several times and each of the blocks has an attention block, in other words , imagine. train a huge neural network and a neural network contains a bunch of aquarium value key matrices that are trained as the neural network is trained to guess the next word, but I'm going to get into the next video, this is what we will learn in the third video of this series, so again this was the second one on mathematical attention mechanisms, the first one had a high level of attention idea and the third one will be on Transformer models, so stay tuned when it comes out, I will put a link . the comments, that's all friends, congratulations on making it to the end, this was a bit of a complicated video, but I hope the pictorial examples were helpful.

Now it's time for some thanks. I wouldn't even have been able to make this video if it weren't for my friend. and a colleague who is a genius and knows a lot about attention and Transformers and actually helped me go through these examples and form these images, so thanks Joel and some more shout-outs. Jay Alamar was also a great help in understanding the Transformers and the focus. I had long conversations where he explained this to me several times and my friend Omar and Omar Flores were also very helpful. I actually have a podcast where I ask questions about Transformers for about an hour and I learned a lot from this podcast.

This in Spanish. but if you speak Spanish, check out the link that is in the comments and it is also on my Spanish YouTube channel serrano.academy and if you like this material definitely check out llm.university, it is a course that I have been doing in coherence with my colleagues very knowledgeable, me or Amir and Jay Alamar, the same J from before. This is a very complete course and is taught in very simple language. He talks about all the things in this video, including incorporating similarities. Transformers attention. In addition, it has many laboratories where. you can do semantic search, you can do rapid engineering and many other topics so it is very practical and it also teaches you how to implement models basically it is a zero to 100 course in llms and I recommend you to check out llm.university and finally If you want to follow me, subscribe to the channel and like or leave a comment.

I love reading the comments, it's serrano.academy. You have also tweeted me at serrano.academy or visit my page where I have blog posts and lots. Of other things, the page is also surrounded in Academy and I have a book called rocking machine learning in which I explain machine learning in this way in a simple and pictorial way with the labs that are on GitHub, so check it out, there are a 40 Serrano discount code. YT if you want to take a look the link and the information is in the comments so thank you very much for your attention and see you in the next video.

Watch Video & Subscribe

If you have any copyright issue, please Contact