MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention

May 16, 2024

Well maybe if the higher ups can take a seat and we can get started, my name is Ava and this is lecture two of 6 one91, thank you John, thank you everyone, it should be a good time, it's time to pack up, so today In this part of the class we are going to talk about problems that we call sequence modeling problems and in the first lecture with Alexander we really analyzed what deep learning is, what are the essential elements of

neural

networks

, what is a model of advance and basically How do we train a

neural

network from scratch using gradient descent?

Now we're going to turn our

attention

to a class of problems that involve sequential data or sequential data processing and we're going to talk about how we can build that now. neural

networks

that are very suitable for tackling these types of problems and we will do it step by step starting from intuition and building our concepts and our knowledge from there, starting right where we left off with perceptrons and feedback models, so to do that , I'd like to first motivate what we mean when we talk about something like sequence modeling or sequential data, so let's start with a super simple example, let's say we have this image of a ball and it's moving somewhere in this 2D space. and your task is to predict where this ball will travel now if I don't give you prior information about the history of the ball, its motion, how it moves, so your guess about its next position will probably be nothing more than a random guess, no However, now if I give an ADD addition to the current position of the ball, information about where that ball was in the past, the problem becomes much easier, it is more constrained and we can get a pretty good prediction of where it is.

More Interesting Facts About,

mit 6 s191 recurrent neural networks transformers and attention...

The ball is more likely to travel to the next one. I love this example because while it is a visual image of a ball moving in a 2D space, this gets to the heart of what we mean when we talk about sequential data or sequential modeling and the truth is that beyond this sequential data we really around us, my voice as I speak to them, the audio waveform is sequential data that could be broken down into fragments and sequences of sound waves and processed as such in a similar way to the language we express. and communicating in written form in text is modeled very naturally as a sequence of characters, individual letters in this alphabet or words or fragments that we could divide the text into smaller components and think of these fragments one by one in sequence beyond that. everywhere from medical readings like electrocardiographs to financial markets and stock prices and how they change and evolve over time to biological sequences like DNA or protein sequences that represent and encode life and Much Beyond, so I think it doesn't Needless to say, this is a very rich and very diverse type of data and class of problems that we can work with here, so when we think now about how we can build on this to answer specific questions about neural networks and modeling deep learning, we can Let's go back to the problem that Alexander presented in the first lesson where we have a simple binary classification task and I'm going to pass this class.

We have a unique input and we are trying to generate a unique output, a classification based on that with sequence modeling. Now we can deal with sequences of data that are sequences, meaning we can have words in sentences in a large body of text and we might want to reason about those sequences of words, for example, taking a sentence and saying, "Okay, Is this positive?" emotion a positive feeling associated with that sentence or is it something different we can also think about how we can generate sequences based on other forms of data let's say we have an image and we want to caption it with a language, this can also be thought of as sequence modeling problem, now we receive a single input, we try to produce a sequential output and finally we can also consider tasks where we have sequence in sequence, let's say you want to translate voice or text between two different languages. is very naturally thought of as a many-to-many or translation-type problem that is ubiquitous in many types of natural language translation frameworks, and therefore here again the diversity and richness of the types of problems that we can consider when we think in sequence, so let's get to the heart of it from a modeling perspective and from a neural network perspective, how we can start to build models that can handle these types of problems and this is something that I personally had a hard time understanding.

Initially, when I got into machine learning, how do we take something where we map input to output and build it to think in sequences and deal with these kinds of time, nature, and sequence modeling problems? I think it really helps again. starting from the fundamentals and developing intuition, which is a constant theme throughout this course, so that's exactly what we're going to do, we're going to go step by step and hopefully understand the models for this type of problem. Okay, so this is the exact same diagram that Alexander just showed. We define the perceptron where we have a set of inputs X1 to XM and our perceptron neuron, our only neuron, is operating on them to produce an output taking their weight.

In this linear combination by applying a non-linear activation function and then generating the output, we also saw how we can now stack perceptrons on top of each other to create what we call a layer where we can now take an input computation through this layer of neurons and then generate an output as a result here, although we still have no real notion of sequence or time. What I'm showing you is just a static single input and single output that we can now think about collapsing the neurons in this layer. to a simpler diagram right where I just took those neurons and simplified them into this green block and in this input and output mapping we can think of it as an input at a particular time step, just a time step T and our network neural is trying to learn a mapping between input and output at that time step, okay, I've been saying, okay, sequence data, it's data over time, what would happen if we took this same model and would we apply it over and over again to all individual time? steps in a data point, what would happen then all I've done here is I took the same diagram.

I just turned it 90°. Now it is vertical where we have an input vector of numbers that our neural network is calculating on it and we are generating an output, let's say we have some sequential data and we no longer have a single time step, we have multiple individual time steps, we start from x0 our first time step in our sequence and what we could do is Now we could take that same model and apply it step by step to the other sectors, the other time steps in the sequence, which could be a potential problem that could arise by treating our sequential data in this kind of isolated step-by-step view.

Yes, I heard some comments that there is inherently this dependency in the sequence, but in this diagram it is completely missing. There is no link between time Step Zero, time two, in fact, in this setup, we are only dealing with time steps. in isolation, but I think we can all appreciate that in the output in a later step we wanted to depend on the input and the observations that we saw before, so in dealing with this information we are completely missing this inherent structure to the data and the patterns that we're trying to learn, so the key idea here is what if we could now build our neural network to try to explicitly model that relationship, that time step relationship H, time step by time step, and an idea It's just let's take this model and link the calculation between time steps and we can do this mathematically by introducing a variable that we call H and H of T represents this notion of state of the neural network and what that means is that that state is actually learned. . and it is calculated by the neuron and the neurons in this layer and then it is transmitted and propagated step by step, step by step, and it is updated iteratively and sequentially, and what you can see here now as we start to build this diagram of modeling.

We can now produce a relationship where the output at a time step T now depends on both the input at that time step and the state of the previous time step that just passed, so this is a really powerful tool. The idea is correct again, this is an abstraction that we can capture in the neural network this notion of State that captures something about the sequence and we are iteratively updating it as we make observations at this time on the data of this sequence and so So this idea of passing the state forward through time is the basis of what we call a

recurrent

cell or

recurrent

neurons and what that means is that the function and computation of the neuron is a product of both the current input as from this memory passed from previous time steps and that is reflected in this state variable and so on the right side of this slide, what you are seeing is basically that model, that neural network model unwrapped or unwrapped to throughout these individual time steps, but the most important thing is that it is just a model that still has this relationship with itself, so this is the part of the Warped Mind where you think about how we uncoil, visualize and reason about this operation in these individual time steps or how to have this recurrence relationship with respect to itself, so this is the The central idea is this notion of recurrence of a neural network architecture that we call recurrent networks RNNs and the rnns are really one of fundamental frameworks for sequence modeling problems, so let's analyze and develop a little more detail and a little more of the mathematics behind rnns now that we have this intuition about state updating and about the recurrence relationship , okay, so our next step, all we're going to do is formalize this, think a little bit more about the key idea that we talked about. is that we have state H often and it is updated at each time step as we process the sequence, that update is captured in what we call this recurrence relation and this is a standard neural network operation just as we saw in the lesson one.

What we are doing is having the cell state variable h of T, we are learning a set of weights w and that set of weights W will be a function of both the input at a particular time step and the information that was transmitted from the previous time step on this variable H often and what's really important to note is that for a particular neur uh RNN layer, we have the same set of weight parameters that are updated as the model is learned. same function, same set of weights, the difference is that we are processing the data step by step.

We can also think about this from another angle in terms of how we can actually implement an RNN. We can begin, we think about initializing the hidden. State and initialization of an input sentence broken down into individual words that we want this RNN to process to make updates to the hidden state of that RNN, all we are going to do is basically iterate through each of the individual words in the steps individual time in the sentence and update the hidden state and generate an output prediction based on the current word and the hidden state and then at the end we can take that learned model that learned the updated hidden state and now generate the next prediction of word for what The word comes next at the end of the sentence and this is this idea of how the RNN includes a state update and finally an output that we can generate per time step and therefore to loop through this component correctly, we have this Input Vector X often.

We can use a mathematical description based on the non-linear activation function and a set of neural network weights to update the hidden state h of and although this may seem complicated, it is actually very similar to what we saw before, all we are doing is we are learning a matrix of weights, we are learning a single matrix to update the hidden state and then one to update the input, we are multiplying them by their inputs, adding them applying a non-linearity and then using this to update the real state. variable H often finally, then we can generate an actual prediction at that time step as a function of that updated internal state H often, so that the RNN has updated its state, we apply another weight matrix and then generate an output prediction agree with that different question. nonlinear functions on T each and if so, how do you have intuition about which one to choose?

Yes, absolutely, so the question is how do we choose the activation function other than tan H. In fact, you can choose different activation functions, we will get a little bit. Later in the conference how we dictate thatintuition and we will also see that there are examples of slightly more complicated versions of rnns that actually have multiple different activation functions within one layer of the RNN, so this is another strategy that can be used to This is the idea now of updating the internal state and generate this output prediction and as we start to see well, we can represent this using this loop function or basically unrolling the state of the RNN at the individual time steps which may be a little more intuitive, the idea here is that you have an input at a particular time step and you can visualize how that input and output prediction occurs at these individual time steps in your sequence, by making the weight matrices explicit, we can see that ultimately this leads both to updates of the hidden state as well as predictions of the output, and further emphasizes the fact that it has the same weightCorrect matrix for the input to the hidden state transformation, that hidden state transformation to output that is being reused and updated effectively in these time steps.

Now this gives us an idea of how we can progress through the RNN to compute predictions and actually learn. For the weights of this RNN, we have to calculate a loss and use the backpropagation technique to learn how to adjust our weights based on how we have calculated the loss and because now we have this way of calculating time step by step. What we can do is simply take the individual loss metric from the individual time steps, add them all together, and get a total loss value across the entire sequence. Progressing a question differs from establishing bias.

A bias is do you know something coming? separate from the . React the weight of the matrix itself is applied to, let's say, the input and transforms the input into this in this visualization and the equations that we showed we sort of abstracted away the bias term, but the important thing to keep in mind is that multiplication of matrices is a function of the learned weight matrix uh multiplied by the input or the hidden state okay similarly now there's a little bit more detail on the inner workings of how we can implement an RNN layer uh from scratch using code in tensor flow so we can introduce correctly, the RNN itself is a layer, a neural network layer, and what we start by doing is first initializing those three sets of weight matrices that are key to the RN calculation, and that This is what is done in this first block of code that we are in.

Seeing that initialization, we also initialize the hidden state. The next thing we have to do to build an RNN from scratch is define how we actually make a prediction, we pass a call to the model and what that amounts to is taking that hidden state. State update equation and translation to Python code that reflects this application of the weight matrix, application of nonlinearity and then calculating the output as a transformation of that right and finally, at each time step, both the state Updated hidden data as the expected output can be returned by the RNN's calling function, this gives you an idea of the inner workings and the computation translated to code, but in the end the right tensorflow and machine learning frameworks abstract a lot of this away so that can assimilate it. and Define the type of dimensionality of the RNN that you want to implement and use built-in functions and layers to define it in the code, so again, this flexibility that we get from thinking about the sequence allows us to think about different types. of problems and different environments where sequence modeling becomes important, we can look again at environments where we are now processing these individual time steps throughout the sequence and generating only one output at the end of the sequence, such Maybe that's a classification of emotion. associated with a particular sentence, we can also think of taking a single input and now generating uh outputs at individual time steps and finally doing the translation from sequence input to sequence output and you will get hands-on practice in implementing and developing a network neuronal.

For this type of problem in today's lab and in the first lab of the course, from here we have talked about how an RNN works and what the underlying framework is. But ultimately, when we think about sequence modeling problems, we can also think. Do you know what are the unique aspects that we need a neural network to capture effectively in order to handle this data well? We can all appreciate that sequences are not all the same length, right? A sentence can have five words, it can have 100. In words, we want our model to have flexibility to be able to handle both cases.

We need to be able to maintain a sense of memory to be able to track these dependencies occurring in the correct sequence. Things that appear very early may have importance later. and so we want our model to be able to reflect that and capture that the sequence inherently has an order, we need to preserve that and we need to learn a conserved set of parameters that are used uh throughout the sequence and are updated and give rnn. they give us the ability to do all of these things, they are better in some ways than others and we will explain a little bit why that is, but the important thing to keep in mind is that as we move through the rest of the process The conference is what it is what we're really trying to get our neural network to be able to do in practice in terms of the capability that it has, so let's now go into more detail about a very typical sequence modeling problem that you'll encounter and that's the following, given a set of words, we want to be able to predict the next word that follows that set of words, so let's make this very concrete, suppose we have this sentence this morning.

I took my cat for a walk, our task could be as follows, given the first words in this sentence, we want to predict the word that follows walk, how can we do this before we think about building our RNN, the first thing we need to do is to have a way of representing this text. this language for neural network remember again, right, neural networks are just numerical operators, right, their underlying computation is just math implemented in code and they don't really have a notion of what a word is, we need a way to represent that numerically to that the network can calculate and understand it, they cannot interpret words, what they can interpret and operate are numerical inputs, so there is a big question in this field of sequence modeling and natural language: how do we really encode the language of a way that it is understandable and makes sense for a neural network to work numerically.

This is what we call embedding and what that means is that we want to be able to transform the input in some different type of modality like language into a numerical vector of a particular size that we can then feed into our neural network model and operate on. with it, so with language there are different ways that we can now think about how we can build this embodiment. A very simple way is to say that we have a large vocabulary. a set of words, all the different and unique words in English, for example, we can take those different and unique words and simply assign them to a number and an index so that each different word in this vocabulary has a different index, then we take and We construct these vectors that are the length of the size of the number of words in our vocabulary and they just indicate with a binary one or zero whether or not that vector represents that word or some other word and this is an idea of what we call a hot one. embed or hot one in coding and it is a very simple but very powerful way to represent language in a numerical form so that we can operate on it with a neural network.

Another option is to do something a little more sophisticated and try learning a numerical code. Vector that maps words or other components of our language to some kind of distribution, some kind of space where the idea is that things that are related to each other in the language should be numerically similar to each other in this space and things that are very different should be numerically different and far apart in this space and this is also a very powerful concept about learning and integration and then taking those learned vectors to a downstream neural network, so this solves a big problem about how we actually encode language .

The next thing in terms of how we approach this sequence modeling problem is that we need a way to be able to handle these sequences of different lengths. A four-word sentence. A six-word sentence. The network needs to be able to handle the problem that arises. with the ability to handle these variable sequence lengths is that now, as your sequences get longer and longer, your network needs to have the ability to capture information from the beginning of the sequence and process it and incorporate it into the output, such again later in the process. sequence and this is this idea of a long-term dependency or this idea of network memory and this is another very fundamental problem for sequence modeling that you will encounter in practice.

The other aspect we're going to touch on briefly is Again, the intuition behind order, the point of sequence is that you know that things that appear in a program in a scheduled or defined way capture something meaningful and therefore even if we have the same set of words, if we reverse the order, the Network representation and modeling of that should be different and capture that order dependency. All of this is to say that in this natural language example, taking the question of predicting the next word, it highlights why this is a challenging problem for a neural network to learn and try to model and fundamentally how we can think about maintaining that in the back of our mind as we're trying to implement, test and build these algorithms and models in practice a quick question yeah big uh embedding uh how do you know what dimension of space?

You're supposed to use it to group things. This is a fantastic question about how big do we set that embedding space. You can imagine it correctly as the number of different things in your vocabulary increases. At first you may think okay, maybe it's a bigger space. it's actually useful but it's not always, it's not true that strictly increasing the dimensionality of that embedding space leads to better embedding and the reason for this is that it gets sparser the bigger you are and effectively then what you're doing is simply by making a lookup table that's roughly close to a one hot encoding uh, so you're defeating the purpose of learning that embedding in the first place, the idea is to have a balance of a small dimensionality but big enough for that embedding space. so that you have enough capacity to map all the diversity and richness of the data, but it's small enough to be efficient and that embedding actually gives you an efficient NE bottleneck and representation, and that's kind of a choice design that you know works. that shows what the effective embedding space is for the language, let's say, but that's the balance we have in mind.

I'm going to continue for the sake of time and then we'll have time for questions at the end. Well, that lets us know how rnns work where we run into these sequence modeling problems. Now let's dive a little deeper into how we actually train RNNs using the same backpropagation algorithm that Alexander introduced. remember in a standard forward network, the operation is as follows, we take our inputs, we calculate on them in the forward path to now generate an output and when we go backwards, when we try to update the weights based on the loss, what we do is we go back and propagate the gradients through the network towards the input to try to tune these parameters and minimize the loss and the whole concept is that we have our loss objective and you are just trying to change the parameters of the model the weights of the model to minimize that objective with rnns now there is a problem because now we have this loss that is calculated step by step while we do this sequential calculation and then added at the end to get a total loss. that means that now when we do our step back to try to learn backpropagation, we just have to backpropagate the gradients per time step and then finally through all the time steps from the end to the beginning of the sequence and This is the idea of back propagation through time because errors also propagate backward along this time axis and to the beginning of the data sequence.

Now maybe you can see why this can become aBit complicated if we look closer. To see how this calculation actually works, what the backup over time means is that as we go step by step, step by step, we have to do this repeated calculation of weight Matrix um weight Matrix weight Matrix weight Matrix and so on and the reason this can be very problematic is that this repeated calculation, if those values are very large and you multiply or take the derivative with respect to those values repeatedly, you can get gradients that actually grow excessively and grow uncontrollably and They exploit in such a way that the network learns. is not really manageable, so one thing that is done in practice is to try to reduce them effectively and reduce them to try to learn effectively, you can also have the opposite problem: if you start and your values are very, very small and has these repeated Matrix multiplications, its values can shrink very quickly until they become smaller and smaller and this is also quite bad and there are strategies that we can employ in practice to try to mitigate this, as well as the reason why this notion of gradient decreases or disappears.

A very real problem with learning an effective model is that we are in some ways shooting ourselves in the foot in terms of our ability to model long-term dependencies and the reason is that as the length of the sequence increases, the idea is correct. . is that you will have to have a larger memory capacity and then you will be able to track these long-term dependencies better, but if your sequence is very large and you have long-term dependencies but your gradients are Disappearing Disappearing Disappearing you are losing all capacity as you go over time to learn something useful and keep track of those dependencies within the model and what that means now is the ability of the Network to model that dependency is reduced or destroyed, so we need real strategies to try to mitigate this in the RNN framework due to this inherent sequential processing of data in practice, going back to one of the previous questions about how we select activation functions, something very common which is What is done in RNN is to choose the functions wisely of activation to be able to try to mitigate this increasingly reduced gradient a little.

Problem with having activation functions that are zero or one, that is, the activation function reu, another strategy is to try to initialize the weights intelligently those first real values of the weight matrices in order to get them at a good starting point, so that once we start doing updates, maybe we'll be less likely to run into this vanishing gradient problem as we do those repeated matrix multiplications. The final and most sound idea in practice is to now build a more robust neural network layer or a recurrent cell itself and this is the concept of what we call activation, which effectively introduces additional computations within that recurrent cell which now will be able to try to maintain, delete or selectively forget some aspects of the information that is entered into the recurrent unit, we are not going to go into detail about how this notion of eating works mathematically for reasons of time and concentration. but the important thing I want to convey is that there is a very common architecture called lstm or long short-term memory network that employs this notion of gate to be more robust than just a standard RNN by being able to track these long-term dependencies the core idea. to take away from that and this gate idea is again, we're thinking about how information is updated numerically within the recurrent unit and what the lstms do is very similar to how the RNN in its own functions has a variable a cell state that is maintained, the difference is how that cell state is updated, it is using some additional layers of computation to selectively forget certain information and selectively keep certain information and this is the intuition behind how these different components uh within an lstm actually interact each other. to now basically provide a smarter update of the cell state that will then better preserve the core information that is necessary.

The other thing I'll point out about this is that this forget or keep operation. I'm talking about this in a very high level and abstract way, but what I want you to keep in mind as well is that all of this is learned based on actual weight matrices that are defined as part of these neural network units. All this is our way of abstracting and reasoning. about the mathematical operations at the core of a network or a model like this, okay, to close our discussion on RNN, we will very briefly touch on some of the applications in which we have seen them used and one of them in common use is the generation of music and this is what you'll actually get hands-on practice with in the software labs, building a recurrent neural network from scratch and using it to generate new songs, so this example I'll touch on is actually a demo from a few years ago of a musical piece generated by an architecture based on a recurrent neural network that was trained in classical music and then asked to produce a part of a piece that was left unfinished by the composer France Schubert, who died. before it could complete this famous unfinished symphony and this was the result of the neural network being asked to now compose two new movements based on the previous true movements, let's see if it continues, but you can, you can appreciate the quality of that and I would also like to briefly highlight that on Thursday we will have an incredible guest lecture that will take this idea of the musical generation to a whole new level, so stay tuned.

I'll just give a teaser and a preview for later we also introduce this sequence classification problem again, something like assigning a sentiment to an input sentence and again, we can think of this as a classification problem where we now reason and we operate on the sequence data, but ultimately we are trying to produce a probability associated with that sequence, whether a sentence is positive or negative, for example, so this gives you two types of music generation, sequence to sequence, generation and also classification that we can think about using recurrent models, but you know. We've rightly talked about these design criteria of what we really want any neural network model to do when it handles sequential data.

It doesn't mean that the response has to be an RNN. In fact, RNNs have some really fundamental limitations due to the very fact that. They are operating at this time step by step. The first is that to encode really long sequences, memory capacity is effectively hindering our ability to do so and what that means is that information in very long sequences can be lost by imposing a bottleneck on the size of that state. hidden that the RNN is actually trying to learn, plus because we have to look at each segment of that sequence one by one, it can be really slow and computationally intensive for when things get longer and longer and like We talked about long-term dependencies, vanishing gradients, the memory capacity of a standard RNN is simply not that much to be able to track sequence data effectively, so let's look at these issues a little more, thinking about our goal of high level of sequence modeling we want to break down our input step by step and basically learn some features of the neural network based on that and use those features to now generate a series of outputs, let's say okay, let's do this. linking the information step by step through this state update and through this idea of recurrence, but as we saw there, these central limitations to that iterative calculation that iterative update in fact, although if we think about what we really want, what We want to en Now we no longer want to be limited to thinking about things step by step, as long as we have a continuous flow of information.

We want our model to be able to handle this. We want the calculation to be efficient. We want to be able to have this long memory capacity to handle those dependencies and rich information, so what if we eliminated this need to process information sequentially step by step and got away with recurrence altogether? How could we learn a neural network in this environment? A naive approach that We could say okay, you know we have sequence data, but what would happen if we all put it together, squashed it, and concatenated it into a single vector? We feed it into the model, calculate some features, and then use them to generate results.

This may seem like a good first try, but yes, even though we eliminated this recurrence, we completely eliminated the notion of sequence in the data, we restricted our scalability because we said we're going to put everything together in a single entry. We have eliminated the order and again, as a result of that, we have lost this memory capacity. The central idea that emerged about 5 years ago when we thought about how we can build a more effective architecture for sequence modeling problems was instead of thinking about things, step by step, in isolation, let's take a sequence as it is and learn a neural network model that can tell us which parts of that sequence are the really important parts, what conveys important information that the network should capture and learn, and This is the core idea of

attention

, which is a very, very powerful mechanism for monitor the neural networks that are used in sequential processing tasks, as a prelude to what is to come and also for a couple of conferences in the future.

I'm sure everyone in this room has hopefully raised their hand if they've heard of GPT or talked to GPT or Bert. Hopefully everyone knows what Transformer means, right, Transformer is a type of neural network architecture that relies on attention as its fundamental mechanism. In the rest of this lecture, you'll get a sense of how attention works, what it does, and why it's such a powerful building block for these great architectures like Transformers, and I think attention is a beautiful concept, it's really elegant and intuitive and hopefully we can convey that in what follows, so the core nugget of core intuition is this idea of let's pay attention and extract the most important parts of an input and what we'll focus on specifically is what we call attention to own attending to the entrance. parts of the entry itself, let's look at this image of the hero Iron Man, how can we find out what is important in this image?

A super naive, super naive way would be to just scan pixel by pixel and look at each one correctly and then be able to say, okay, this is important, this is not, but our brains are immediately able to look at this and select, yes, Iron Man is important, we can concentrate and pay attention to that, that is the intuition that identifies which parts of an input to attend to and pull. Find out the associated function that this high attention score has. This is very similar to how we think about searching a database or searching an entry to extract those important parts, so let's say you now have a search problem and you came to this class.

With the question, how can I learn more about Neural Network Deep Learning AI? One thing you can do besides taking this class is go online, go to YouTube and say, let's try to find something that will help me with this. search right now, so now we are searching a giant database, how can we find and address what is the relevant video to help us with our search problem? Well, start by providing a deep learning query and now that query needs to be compared to what we have in our database titles of different videos that exist, let's call these keys and now our operation is to take that query and our brains match what My query is closer to the correct one.

Is this the first video of beautiful elegant sea turtles on coral reefs how similar is my query to that not similar is it similar to this second key value key uh key entity the conference 20 2020 introduction to deep learning yes it is similar to this last key no, so we're calculating this effective attention mask from this metric of how similar our query is to each of these Keys, now we want to be able to extract the relevant information from that match and this is the return of the value that this has higher notion of this intuition of Attention, this is a metaphor, an analogy to this search problem, but it can convey these three key components of the attention mechanism and how it works mathematically, so let's analyze that and now return to our model problem sequence of looking at a natural language sentence and trying to model that our goal with a neural network that uses self-attention is to look at this input and identify and attend to the features that are most important.

The first thing we do is correct, we are not going to do it. We do not go tohandle this sequence step by step, but we still need a way to process and preserve information about position and order. What is done in self-attention and in Transformers is an operation we call position-aware encoding. or a positional encoding, we're not going to go into the details of this mathematically, but the idea is that we can learn an embedding that preserves information about the relative positions of the components of the sequence and that are clear and elegant mathematical solutions. which allow us to do this very effectively, but all you need to know for the purpose of this class is that we take our input and we do this calculation that gives us a position aware embedding.

Now we take that positional embedding and calculate a form of that query which key and that value for our search operation, remember our task here is to extract what in that input is relevant to our search, how do we do it, this is the message of this class in general, learning the layers of the neural network and in intention and in Transformers, that's it. positional embedding is replicated three times in three separate linear neural network layers that are used to compute the query values of the key and value of these three sets of arrays, these are the three uh The three main components of that operation query that we present with the YouTube analogy query key and value.

Now we do exactly the same task of calculating similarity in order to calculate this attention score. What you do is take the Matrix query, the Matrix key, and define a way to calculate how similar they are. Remember well, these are numerical matrices in some space and intuitively you can think of them as vectors in high-dimensional space. When two vectors are in that same space, we can look and mathematically measure how close they are to each other using a metric. Computing the dot product of those two vectors, also known as cosine similarity, reflects how similar these query and key uh matrices are in this space, giving us a way, once we employ some scaling, to actually compute a metric of this weight of attention, and so on.

Now, thinking about what this operation actually does, remember that this whole point of query and key calculation is to find the features and components of the input that are important for this self-attention, so what is done is if we take, Let's say, visualizing the words in a sentence we can calculate this own attention score, this waiting attention that now allows us to interpret what is the relative relationship of those words in that sentence with respect to how they relate to each other and that's all under from the fact that we are relearning this operation directly on the input. and attending to parts of it, we can basically reduce that similarity so that it is between zero and one using an operation known as Soft Max and this gives us concrete weights that calculate these scores attention, the final step in the entire self-care process.

Now it's taking that attention, waiting, taking our array of values, multiplying them and actually extracting features from this operation, so it's a really elegant idea to take the input itself and take these three interactive components of the query key and the courage to not just identify what it is. It's important, but we actually extract relevant features based on those attention scores. Let's put it all together step by step. The overall goal here is to identify and address the most important features. We take our positional coding. It captures some notion of order and position. We extract these query key value arrays.

Using these learned linear layers we calculate the similarity metric using the cosine similarity calculated through the dot product, we scale it and apply softmax to put it between zero and one constituting our attention weights and finally we take that entity multiplied by the value and use This it actually extracts features, uh, relative to the input itself that have these high attention scores. All of this together forms what we call a single self-attention head and the beauty of this is that now you have a hierarchy and you can place multiple attention heads. together to design a larger neural network like a Transformer and the idea here is that this attention mechanism is really the fundamental component of the Transformer architecture and the way this architecture is so powerful is the fact that we can parallelize and stack these attention heads. together basically to be able to attend to different features, different components of the input that are important, so we can have, let's say, an attention mask that serves Iron Man in the image and that is the output of the first attention head, but We could have other attention. heads that are picking up on other relevant features and other components of this complex space, so hopefully you have an understanding of the inner workings of this mechanism and your intuition and the elegance of this operation and attention, now we're really seeing it and the basis of Transformer architecture applied to many different domains and many different settings, perhaps the most prominent and notable is natural language with models like GPT, which is the basis of a tool like GPT chat, so you actually get hands-on experience in building and tuning large language models in the final lab of the course and also delving deeper into the details of these architectures later on, although it doesn't stop there because of this natural notion of What is language?

What is sequence? The idea of attention and a transformer extends far beyond human text and written language. We can model sequences in biology like DNA or protein sequences using these same principles and these same architecture structures so far. Reasoning about biology in a very complex way to do things like accurately predict the three-dimensional shape of a protein based solely on sequence information. Finally, correct

transformers

and the notion of attention have been applied to things that are not intuitively sequenced data or language. even in tasks like computer vision with architectures known as Vision Transformers that again employ this same notion of self-attention, so to close and summarize well, this is a very dizzying race through what sequence modeling is, how the rnns are a good first starting point for sequence modeling tasks using this notion of time step processing and recurrence, we can train them using a backrop over time, we can implement rnns and other types of sequence models for a variety of tasks, music generation and beyond, we saw how we can go beyond recurrence to really learn. self-attention mechanisms to model sequences without having to go step by step and finally how self-attention can form the basis for very powerful and very large architectures, such as big-legged models, that concludes today's part of the lecture, thank you very much for your attention and for putting up with us through this Sprint and this week-long boot camp and with that I'll close and say that we now have open time to talk to each other with the Tas and with the instructors about your questions. start with the labs start implementing thank you very much

Watch Video & Subscribe

If you have any copyright issue, please Contact