MIT 6.S191 (2023): Recurrent Neural Networks, Transformers, and Attention

May 12, 2024

We try to extract information from this input data. The key idea is the idea of being able to identify and pay

attention

to what is important in a potentially sequential flow of information and this is the notion of

attention

or self-attention, which is an extremely powerful tool. I can't underestimate the concept of modern deep learning and AI, or I don't know, I don't understand exaggerating. I can't emphasize enough how powerful this concept is. Attention is the fundamental mechanism of the Transformer architecture that many of you will have heard of and is the notion. of a transformer can often be very daunting because sometimes you are presented with these really complex diagrams or implemented in complex applications and you may think well how do I start to make sense of this in its essence, although attention, the key operation is an intuitive idea and in the last part of this lecture we will break it down step by step to see why it is so powerful and how we can use it as part of a larger

neural

network like a Transformer, specifically we are going to be talking and focusing on this idea of Self-attention pay attention to the most important parts of an input example, so let's consider an image.

I think it's more intuitive to consider an image first. This is an image of Iron Man and if our goal is to try to extract information from this image of what is important, what we could do is perhaps use our eyes to naively scan this image pixel by pixel, simply by scanning the image, without However, our brains may internally be doing some kind of calculation like this, but you and If we can just look at this picture and pay attention to the important parts, we can see that it's Iron Man coming towards you right in the picture and then we can focus. a little bit more and say: what are the details about Iron Man?

More Interesting Facts About,

mit 6 s191 2023 recurrent neural networks transformers and attention...

That can be important. The key is that what you are doing is having your brain identify which parts it is attending to and then extract those features that deserve the most attention. The first part of this problem is really the most interesting and challenging. It's very similar to the concept of search, effectively that's what search does, it takes a broader set of information and tries to extract and identify the important parts, so let's go next, how does search work? Are you thinking you're in this class? How can I learn more about

neural

networks

today. One thing you can do besides coming here and joining us is go on the Internet and have all the videos available trying to find something that matches, doing a search operation to have a giant database. like YouTube, you want to find a video, you go into your deep learning query and what comes out are some possible results, right for each video in the database, there will be key information related to the interview for that video, let's say the title now for do the search what is the task to find the overlaps between your query and each of these titles the keys in the database what we want to calculate is a similarity and relevance metric between the query and these keys how similar are they to our desired query and we can do this step by step, let's say this first option of a video about elegant giant sea turtles not so similar to our query on deep learning our second option introduction to deep learning the first introductory lecture on this class yes very relevant the third option a video about the late, great Kobe Bryant not so relevant the key operation here is that there is this similarity calculation that brings together the query and the key the final step is now that we have identified which key is relevant to extract the relevant information what we want to pay attention to and that is the video itself, we call it the value and because the searches are implemented well, we have successfully identified the relevant video on deep learning that you will want to pay attention to and it is this idea, this intuition of giving a query trying to find similarities trying to extract the related values that form the basis of self-attention and how it works in neural

networks

like Transformers, so to get concretely into this, let's now go back to our Send a text message in our example of language with the sentence we aim to identify and pay attention to features in this entry that are relevant to the semantic meaning of the sentence.

Now, in the first step, we have a sequence. We have an order. We have eliminated recurrence. Now we are feeding everything. Time passes at once, we still need a way to encode and capture this information about order and this positional dependency. The way this is done is this idea of possession of positional encoding that captures some inherent order information present in the sequence. I touch on this very briefly, but the idea is related to this embedding idea I introduced earlier. What is done is a neural network layer that is used to encode positional information that captures the relative relationships in terms of order within this text.

That is the high level. Correct concept, we can still process these time steps all at once. There is no notion of passage of time, rather the data is singular, but we still learned this encoding that captures positional order information. Now our next step is to take this coding and figure out what exactly to pay attention to, like that lookup operation I introduced with the YouTube example, extract a query, extract a key, extract a value and relate them to each other, so we use layers of neural networks to do exactly this, given this positional encoding, what attention does.

This applies a neural network layer, transforming that by first generating the query, we do this again using a separate neural network layer and this is a different set of Weights, a different set of parameters which then transforms that positional embedding in a different way, generating a second output. key and finally this repetition this operation is repeated with a third layer a third set of Weights generating the value now with these three in hand the key the query the key and the value we can compare them with each other to try to find out where in that self-entry, the network should pay attention to what is important and that is the key idea behind this similarity metric or what can be considered an attention score.

What we are doing is calculating a similarity score between a query and the key and remember that these query and key Qui values are just arrays of numbers. We can define them as arrays of numbers that you can think of as vectors in space. The consultation. Vector. The values of the query are some. Vector. The key.. The key values are some other vector. and mathematically, the way we can compare these two vectors to understand how similar they are is by taking the dot product and scale, it captures how similar these vectors are and whether or not they point in the same direction, this is the similarity metric. and if you're familiar with a bit of linear algebra, this is also known as cosine similarity operation functions in exactly the same way for matrices, if we apply this dot product operation to our query on key matrices, key matrices we get this similarity metric now This is very, very key to defining our next step.

Calculate the waiting attention in terms of what the network should actually serve within this input. This operation gives us a score that defines how the components of the input data relate to each other. a sentence, just when we calculate this similarity score metric, we can start thinking about weights that define the relationship between the sequential components of the sequential data to each other, so for example, in this example with a text sentence, launched the tennis ball. To serve the goal with scoring is that the words in the sequence that are related to each other should have high attention weights related to the ball related to the throw related to tennis and this metric itself is our attention waiting for what we have done passing that similarity score through a Soft Max function all it does is limit those values to be between 0 and 1.

So you can think of these as relative scores of relative attention weights, finally now that we have this metric that can capture this notion of similarity and these internal factors. personal relationships finally we can use this metric to extract features that deserve a lot of attention and that is exactly the final step in this self attention mechanism where we take that attention waiting Matrix multiply it by the value and we get a transformed transformation of the data initials as our output, which in turn reflect the characteristics that correspond to high attention. Okay, let's take a breath, recap what we just covered so far.

The goal of this idea of self-care, the backbone of Transformers, is to eliminate recurring attendance. to the most important features in the input data in an architecture, how is this actually implemented, first we take our input data, we calculate these positional encodings, the neural network layers are applied three times to transform the positional encoding in each of key queries. and value matrices, we can then calculate the self-attention weight score according to the scalable product operation we performed before, and then self-attend these features to this information to extract features that deserve great attention. What's so powerful about it?

This approach by taking this attention, wait, couple it with the value to extract the high attention features is that this operation, the scheme I'm showing on the right, defines a single self-attention head and you can link several of these self-attention heads . together to form larger network architectures where you can think of these different heads trying to extract different information and different relevant parts of the input to now put together a very rich encoding and representation of the data that we are working with intuitively to return to our Ironman. For example, what this idea of multiple self-attention heads may mean is that different salient features and salient information are first extracted from the data.

You may consider that Iron Man's attention had one and you may have additional attention heads that are picking out other relevant parts of the data that we may not have noticed before, for example the building or spaceship in the background that is chasing Iron Man and therefore this is a key component of many, many, many powerful architectures that exist today. Again, I can't emphasize this enough. how powerful this mechanism is and in fact this core idea of self-attention that you just understood is the key operation of some of the most powerful neural networks and deep learning models that exist today, ranging from very powerful language models like gpt3. that are capable of synthesizing natural language in a very human way, digesting large amounts of textual information to understand relationships in text with models that are being implemented for extremely impactful applications in biology and medicine, such as Alpha full 2, which uses this notion of self. -attention to looking at protein sequence data and being able to predict the three-dimensional structure of a protein, only with sequence information and so far up to computer vision, which will be the topic of our next conference tomorrow, where the same idea.

Attention that was initially developed in sequential data applications has now transformed the field of computer vision and again uses this key concept of paying attention to important features in an input to build these very rich representations of complex, high-dimensional data. Okay, that concludes the lectures for Today I know we've covered a lot of territory in a pretty short period of time, but that's what this boot camp program is all about, so I hope today you got a sense of the basics of networking. neurons in the conference with Alexander that we spoke about. about rnns, how they are suitable for sequential data, how we can train them using back propagation, how we can implement them for different applications and finally how we can go beyond recursion to build this idea of self-attention to build more and more powerful models for deep analysis. learning in sequential modeling, okay, I hope you enjoyed it, we have about 45 minutes left for the lab part and the open office hours where we invite you to ask us questions about us and Tas and start working in the labs information.

For the laboratories it is up there, thank you very much for your attention foreigner

Watch Video & Subscribe

If you have any copyright issue, please Contact