MIT Introduction to Deep Learning | 6.S191

May 07, 2024

good afternoon everyone and welcome to MIT sus1 191 my name is Alexander amini and I will be one of your instructors for this year's course along with Ava and together we are very excited to welcome you to this truly amazing course. A fast-paced and very intense week that we are about to spend together, so we are going to cover the foundations of a field that also moves at a very fast pace and a field that has been changing rapidly for the last eight years. that we've been teaching this course at MIT for the last decade, in fact, even before we started teaching this course Ai and

deep

learning

has really been revolutionizing so many different advances and so many different areas of science, meth, math, physics , etc., and no. that a long time ago we had new kinds of challenges and problems that we didn't think were necessarily solvable in our lives that AI is now really solving uh Beyond human performance today and every year that we teach this course uh this particular conference It's becoming increasingly difficult to teach because for an introductory level course this lecture number one is the lecture that is supposed to cover the fundamentals and if you think of any other introductory course as an introductory 101 course on math or biology, those lectures don't really It changes a lot over time, but we are in a rapidly changing field of AI and

deep

learning

where even these types of conferences are changing rapidly, so let me give you an example of how we introduced this course just a few years ago.

Hello everyone. and welcome to MIT 6s one91, the official introductory deep learning course taught here at MIT. Deep learning is revolutionizing many fields, from robotics to medicine and everything in between. You will learn the fundamentals of this field and how these incredible algorithms can be built. in fact, this whole speech and video is not real and was created using deep learning and artificial intelligence and in this class you will learn how it has been an honor to speak with you today and I hope you enjoy the course, the really amazing thing. about that video when we first did it, it was how viral it went a few years ago, so just in a couple of months while we were teaching this course a few years ago, that video went very viral, right, it got more than one million views in just a few.

More Interesting Facts About,

mit introduction to deep learning 6 s191...

For months, people were surprised by a few things, but the main one was the realism of the AI to be able to generate content that looks and sounds extremely hyper-realistic, and when we made this video, when we created this for the class just a few years ago, this The video cost us around $10,000 and we computed to generate a video that was about a minute long. I mean, if you think about it, I would say it's extremely expensive to calculate something the way we see it like this and maybe a lot of you aren't even really impressed. technology today because you see all the amazing things that AI and deep learning are producing now, progress in deep learning is moving rapidly today, yeah, and people were making all kinds of interesting comments about it when it came out a few years ago.

This is a common thing because AI is actually doing much more powerful things than this fun little introductory video, so today let's fast forward four years or so, yes, four years to today, where are we? AI is now generating content with the highly commoditized deep learning. learning is at our fingertips now online on our smartphones and so on, in fact we can use deep learning to generate this type of hyper-realistic media and content completely from the English language without even coding, just before we would have to enter. train these models and actually code them so you can create that minute long video.

Today we have models that will do it for us from start to finish directly from the English language, so we can create these models to create something that the world has never seen before. a photo of an astronaut riding a horse and these models can imagine those contents completely from scratch. My personal favorite is how we can now ask these deep learning models to create new types of software, even if they themselves are software to ask them to create. For example, to write this tensorflow code snippet to train a neural network, we are asking one neural network to write tflow code to train another neural network and our model can produce examples of functional and usable code snippets that satisfy this message. in English. while walking through each part of the code independently, so not only producing it, but actually educating and teaching the user about what each part of these blocks of code actually does, you can see an example here and actually what I'm dealing with to show him.

All of this is that this just highlights how far deep learning has come even in just a couple of years since we started teaching this course. I mean go back from before that to eight years ago and the most amazing thing you'll see. In my opinion, in this course what we're trying to do here is teach you the fundamentals of all this, how you create all these different types of models from scratch and how we can make all these incredible advances possible. which you can also do on your own and as I mentioned at the beginning, this introductory course is becoming more and more difficult to do and complete each year.

I don't know where the field will be next year and I mean, that's my honest truth or even honestly a month or two from now, just because it's moving incredibly fast, but what I do know is that what we'll share with you in the course as part of this week is It will be the foundations of all the technological technologies that we have seen up to this point that will allow you to create that future for yourself and design new types of deep learning models, using those foundations and those basics, so let's get started. with all of that and start to figure out how we can actually accomplish all of these different pieces and learn all of these different components and we should start by really addressing the fundamentals from the beginning and asking ourselves, you know, we've heard In this quarter I think all of you, obviously, before Coming to this class today, you've heard the term deep learning, but it's important that you really understand how this concept of deep learning relates to all the other pieces of science that you know.

I've learned so far, so to do that we have to start from the beginning and start thinking about what intelligence is at its core, not even artificial intelligence, but simply intelligence, so the way I like to think about this is that I like it. To think that intelligence is the ability to process information that will inform your future decision-making abilities, that's something that we, as humans, do every day. Now artificial intelligence is simply the ability for us to give computers that same ability to process information and inform future decisions now machine learning is simply a subset of artificial intelligence.

The way you should think of machine learning is simply as programming ability or, let's say, even simpler than that, machine learning is the science of trying to teach computers how to do that processing. of information and decision making from data, so instead of coding some of these rules into machines and programming them like we used to do in software engineering classes, now we're going to try to do that information processing and inform future decision making. of decisions. skills directly from the data and then go a step further, deep learning is simply the subset of machine learning that uses neural networks to process pieces of raw data, now raw data, and allows them to ingest all that data very big. sets and informs future decisions, that's exactly what this class is about if you think that if I had to summarize this class in one line, it's about teaching machines how to process data, process information and inform decision making skills from of that data. and learn it from that data.

Now this program is actually divided into two different parts, so you have to think of this class as being captured with both technical lectures, of which, for example, this is a part, as well as the software labs. We will have several new updates. this year, as I mentioned earlier, I'm just going to cover the rap change of advances in Ai and especially in some of the later lectures, you'll see those, the first lecture today is going to cover the fundamentals of neural networks themselves, really starting with the construction. blocks each neural network which is called a perceptron and finally we will review the week and conclude with a series of interesting guest lectures from leading industry sponsors of the course and finally on the software side after each lecture.

You will also gain software experience and project building experience so you can take what we teach in the lectures and implement it into real code and produce it based on the learnings you will find in this lecture and at the end of the class from the beginning. On the software side, you'll get to participate in a really fun day at the end, which is the project pitch competition, it's like a shark tank style competition of all of your different projects and win some really awesome ones. awards, so let's go over that a little bit briefly. This is the conference curriculum part, so each day we will have dedicated software labs that will basically mirror all the technical conferences that we do, just help you reinforce your learnings and these. are combined each day again with prizes for the best performing software solutions to be presented in class.

This will start today with lab one and it will be on music generation, so you will learn how to build. a neural network that can learn from a bunch of musical songs, listen to them and then learn to compose new songs in that same genre tomorrow lab two on computer vision you will learn about facial detection systems you will build a facial detection system from scratch using convolutional neural networks, tomorrow you'll learn what that means and you'll also learn how to eliminate the biases that exist in some of these facial detection systems, which is a big problem for the state-of-the-art solutions that exist today and finally a new lab at the end of the course will focus on large language models, where you will actually take a large language model of a billion parameters and fine-tune it to build it. an assistive chatbot and assess a set of cognitive skills ranging from mathematical skills to scientific reasoning and logical skills, etc. and finally, at the end, there will be a final project presentation competition of up to 5 minutes per team and all of these are accompanied by great prices, so there will definitely be a lot of fun throughout the week.

There are many resources to help with this class. You will see them published here. You don't need to write them down because all the slides. They are already posted online, please post on Piaza if you have any questions and of course we have an amazing team who are helping to deliver this course this year and you are welcome to contact any of us if you have any questions. Piaza is a great place. To start, AA and I will be the two keynotes for this course, especially Monday through Wednesday, and we'll also hear some amazing guest lectures in the second half of the course that you'll definitely want to attend because they really cover the really advanced aspects of deep learning. , that is happening in the industry outside of academia and, very briefly, I just want to say a huge thank you to all of our sponsors who, without your support, this course, like every year, would not be possible.

So now let's start with the fun stuff and my favorite part of the course, which is the technical parts, and let's start by just asking ourselves a question, you know? Why do we worry about all this? Why do we care about the deep? learning why you all came here today to learn and listen to this course, so to understand, I think again we need to go back a little bit to understand how machine learning used to be done properly, so machine learning would typically define a set of features or it can Think of them as a kind of set of things to look for in an image or data, they are generally designed by hand, so humans would have to define them themselves and the problem with them is that they tend to be very fragile in practice simply because of the nature of a human being that defines them, so the key idea of continuing to learn and what you are going to learn throughout this week is this paradigm shift of trying to get away from the characteristics and rules of manual engineering that shouldhave the computer. and instead try to learn them directly from raw data, then what are the patterns that we should observe in the data sets, so that if we observe those patterns we can make some interesting decisions and interesting actions can emerge, for example? if we wanted to learn how to detect faces, we could if you think about, even how would you detect faces correctly, if you look at an image, what are you looking for to detect a face, are you looking for some particular patterns, are you looking for eyes. and noses and ears and when all those things are composed in a certain way, you would probably deduce that it is a face.

Computers do something very similar, so they have to understand what are the patterns they are looking for, what are the eyes, noses and ears. of that data and then from there, detect it and predict it from it, so the really interesting thing I think about deep learning is that these foundations for doing exactly what I just mentioned select the building blocks, select the characteristics of the raw data. and the underlying algorithms themselves have been around for many decades, the question I would ask at this point is: why are we studying this now and why is all of this really exploding right now and exploding with so many great breakthroughs?

Well, to start, there are three things, right number. One is that the data we have available today is significantly more generalized. These models are hungry for data. You'll learn more about this in more detail, but these models are extremely data hungry and we're living in a world right now. Frankly, where data is more abundant than ever in our history, now, secondly, these algorithms are compute-hungry and enormously parallelizable, meaning they have benefited greatly from computing hardware that is also capable of parallelize. The particular name of that hardware is called GPU. GPUs can perform parallel processing of information streams and are particularly susceptible to deep learning algorithms and the abundance of GPUs, and that computing hardware has also driven what we can do in deep learning and, ultimately, the last piece is the software.

It's the open source tools that are really used as the building blocks for implementing and building all of these underlying models that you're going to learn in this course and those open source tools have become extremely streamlined. making it extremely easy for all of us to learn about these technologies within an amazing course like this, so let's start now with understanding, now that we have some background, let's start with understanding exactly what the fundamental component of a neural network is. that building block is called a perceptron, each of the perceivers, each neural network, is made up of multiple perceptrons and you will learn how those number one perceptrons calculate information for themselves and how they connect to these billion parameter neural networks much larger, so the key idea of a perceptron or even simpler, think of this as a single neuron, so a neural network is made up of many neurons and a perceptron is just one neuron, so the idea of A perceptron is actually extremely simple and I hope that by the end of today, this idea and this processing of a perceptron becomes extremely clear to you, so let's start by just talking about the direct propagation of information through a single neuron.

Now individual neurons ingest information, they can actually ingest multiple pieces of information, so here you can see that this neuron has input three pieces of information X1 X2 and XM, so we define the set of inputs called x 1 to M and each of these inputs, each of these numbers, will be multiplied by elements by a particular weight, so this is going to be denoted here by W1 to WM, so this is a corresponding weight for each input and you should think of this as if you really know that each weight is assigned to that input, the weights are part of the neuron itself, now you multiply them all. these inputs with their weights together and then the sums we take this single number after that sum and you pass it through what's called a nonlinear activation function to produce your final output which here will call and now what I just said no that's completely correct right, I missed some critical information, that information is that we also have what you can see here is called this bias term, that bias term is actually what allows your neuron to change its activation function horizontally. on that x axis if you think about it, on the right side you can now see this diagram that mathematically illustrates that one equation that I talked about conceptually now.

You can see it mathematically written as a single equation and we can rewrite it using linear algebra. using vectors and scalable products, so let's do it right now, so now our inputs are going to be described with a capital x which is just a vector of all of our inputs uh W1 to WM the input is obtained by taking the dot product of that bias term here, we will call the bias term w0 correctly and then apply the nonlinearity which is denoted here as Z or G. Sorry, I have mentioned this nonlinearity several times in this activation. function Let's dig a little deeper so we can understand what this activation function really does well.

I said a couple of things about that. I said it is a non-linear function. Here you can see an example of a fun activation function, a common one, huh. The commonly used activation function is called the sigmoid function, which you can actually see here at the bottom right of the screen. The sigmoid function is very commonly used because its output is correct, so it takes as input any real number, the x-axis is infinite more or less, but in the Y AIS it basically squashes each input X into a number between Z and one, so it's actually a very common choice for things like probability distributions if you want to convert your responses into probabilities or learn or teach a neuron to learn a probability distribution, but in fact there are many different types of nonlinear activation functions which are used in neural networks and here are some common ones and again throughout this presentation you'll see these little tensorflow icons actually throughout the entire course.

These tensorflow icons at the bottom basically allow you to relate some of the fundamental knowledge that we're teaching in the lectures to some of the software labs and this could provide a good starting point for a lot of the pieces that I need to do further. ahead in the software parts of the class, so the sigmoid activation that we talked about on the last slide here is shown on the left side. This is very popular because of probability distributions. Crush everything between zero and one. but you see two other very common types of activation functions in the middle and on the right side as well, so the other very, very common one, this is probably the one that is now the most popular activation function, is now on the far right .

It's called the relu activation function or it's also called the rectified linear unit, so it's basically linear everywhere except there's a nonlinearity where x equals z, so there's kind of a step or break discontinuity, like this which the benefit of this is very easy to calculate, it still has the nonlinearity that we need and we'll talk about why we need it in a second, but it's very fast, just two linear functions piecewise combined with each other, okay, now let's talk about why do we need a nonlinearity in the first place, why? We're not just dealing with a linear function that we pass all of these inputs through, so the whole point of the activation function, even why we have this, is to introduce nonlinearities into itself, so what we want to do is allow our neural network to deal with non-linear data, right, our neural networks need the ability to deal with non-linear data because the world is extremely non-linear, right, this is important because you know, if you think about real data sets from the real world, this is the way they are correct if you look at data sets like this, the green and red dots, and I ask them to build a neural network that can separate the green and red dots, this means that we actually need a nonlinear function to do it and we cannot solve this problem with a single line.

In fact, if we use linear, uh, linear functions as the activation function, no matter how large your neural network is, it is still a linear function because linear functions combined with linear functions are still linear, so no matter how depth or how many parameters your neural network has. The best they could do to separate these green and red points would be like this, but adding nonlinearities allows our neural networks to be smaller by allowing them to be more expressive and capture more complexities in the data sets and this allows them to be much more powerful at final so let's understand this with a simple example imagine now I give you this trained neural network so what does trained neural network mean?

It means that now I'm giving you the right weights, not just the inputs, but I'll tell you what the weights of this neural network are, so here let's say the bias term w0 will be one and our vector W will be 3 and ne2. These are just the weights of your train neural network, let's worry. how we got those weights in one second but this network has two inputs X1 and climb. the inputs with weights add the bias and apply the nonlinearity correctly and those are the three components that you really need to remember as part of this correct dot product of class, add the bias and apply a nonlinearity that will be the process that will keep repeating a and again. over and over again for each neuron after that happens, that neuron was going to output a single number right now, let's look at what's inside that nonlinearity, it's simply a weighted combination of those inputs with those right weights, so if you look at what's inside G, just inside G is a weighted combination of It's really a two.

Dimensional line because we have two parameters in this model, so we can draw that line. We can see exactly how this neuron separates the points on these axes between X1 and X2. These are the two inputs to this model that we can see exactly and interpret exactly. what this neuron is doing right, we can visualize all of its space because we can draw the line that defines this neuron correctly, so here we are plotting when that line is equal to zero and in fact, if I hit that neuron, in fact , a new data point here the new data point is X1 = -1 and It says that you know what the answer will be, what the sign of the answer will be and also what the correct answer will be, so if we follow the equation written above and connect it: 1 and 2 we will get 1 - 3 - 4 which is equivalent to -6 correct and when I put it in my nonlinearity G I will get a final result of 0.2 correct so we don't worry about the final result. output which will only be the output for that signal function, but the important point to remember here is that the sigmoid function actually splits the space into these two parts, it squashes everything between Z and one, but it implicitly splits it by whatever less than 0.5. y greater than 0.5 depending on whether it is on, if x is less than zero or greater than zero, depending on which side of the line it falls on, remember that the line is when x equals z, the input to the sigmoid is zero if it falls on the left side of the line your output will be less than 0.5 because you are falling on the negative side of the line if your output is on the right side of the line now your output will be greater than 0.5 Okay, here we can visualize this space.

This is called the feature space of a neural network. We can visualize it in its completion. We can fully visualize and interpret this neural network. We can understand exactly what you will do with any input. See okay, but of course, this is a very simple neuron, it's not a neural network, it's just one neuron and even more so, it's a very simple neuron, it only has two inputs, so actually, the types of neurons you are What we're going to be dealing with in this course are neurons and neural networks with millions or even billions of these parameters right from these inputs, so we only have two weights here.W1 W2, but today's neural networks have billions of these parameters, so drawing the types of graphs you see here obviously becomes much more challenging;

It's not actually possible, but now that we have some of the intuition behind a perceptron, let's start by building neural networks and seeing how this all comes together, so let's review the diagram above. of a perceptron now again, if there is only one thing to learn from this lecture right now is to remember how a perceptron works, that equation of a perceptron is extremely important for every class that comes after today and there are only three steps, it is the scalar product with the inputs add a bias and apply their nonlinearity. Let's simplify the diagram a bit.

I'll remove the weight labels from this image and now you can assume that if I display a line, each line has a weight associated with it that comes with that line. I will also remove the bias term for Simplicity, I will assume that every neuron has that bias term. I don't need to show it and now note that the result here is called Z, which is just the dot product uh plus the bias before the nonlinearity. First of all, the output will be linear. It's just a weighted sum of all those pieces. We have not applied nonlinearity yet, but our final output will be G of Z.

It is the activation function or nonlinear active function. applied to Z now if we want to step this up a little bit more and say what if we had a multiple output function now not only do we have one output, but let's say we want to have two outputs, now we can only have two neurons in this networkOkay, each neuron says which sees all the inputs that came before, but now you see that the top neuron is going to predict a response and the bottom neuron is going to predict its own response. The most important thing is that one thing you should really notice here is that each neuron has its own weights, each neuron has its own lines that go into only that neuron, so they act independently, but then they can communicate if you have another layer correct, so let's start now by initializing this process a little more and Thinking about it more programmatically, what if we wanted to program this neural network ourselves from scratch?

Remember that equation I told you, it didn't sound very complex. You need a DOT product, add a bias that is a single number, and apply nonlinearity. Let's see how we would actually implement something like that to define the layer, now let's call it a layer, uh, which is a collection of neurons. We first have to define how that information spreads across the network so we can do that. that by creating a function call here first, we're going to properly define the weights for that network, so remember every network, every neuron. I should say that every neuron has weights and a bias, so let's define those first, let's create the function call to actually see how we can pass information through that layer, so this will lead us to correct inputs and inputs.

This is like what we called earlier X and it's the same story that we've been seeing throughout this class. We're going to multiply Matrix or take a DOT product of our inputs with our weights, add a bias, and then apply a nonlinearity. It's really that simple. Now we have created a single layer neural network. So this particular line is the part that allows us to be a powerful neural network while maintaining that nonlinearity and the important thing here is to keep in mind that modern deep learning libraries and toolboxes already implement many of these correctly, so It is important to understand the fundamentals, but in practice all that layer architecture and all that layer logic is implemented in tools like tensorflow and P torch through a dense layer, so here you can see an example of how call or create dense layer initialization with two. correct neurons, allowing it to be fed an arbitrary set of inputs.

Here we are seeing that these two neurons in a layer receive three correct inputs and in the code it only boils down to this line of tensorflow code, which makes it extremely easy and convenient for us. use these functions and call them, so now let's look at our single layer neural network, this is where we now have a layer between our input and our outputs, so we are slowly and progressively increasing the complexity of our neural network so that we can build all these building blocks, this layer in the middle is called the hidden layer, obviously because you don't observe it directly, you don't monitor it directly, you do observe the two input and output layers, but your hidden layer is just a kind of neuron layer which is not directly observed, it simply gives your network more capacity, more learning complexity and since we now have a transform function from inputs to hidden layers and hidden layers to output, we now have two - layered neural network, which means that we also have two weight matrices, not only do we have the W1 that we had before to create this hidden layer, but now we also have W2, which performs the transformation from the hidden layer to the output layer, yeah, what? nonlinearity occurs in Hidden, you just have linear, so there is no, it's not a perceptron or not, yes, so every hidden layer also has a nonlinearity accompanied with it, and that's a very important point because if you don't have that perceptron, then it's just a very large linear function followed by a final nonlinearity at the end, so you need that cascade and, you know, the overlapping application of nonlinearities that occur throughout the network is cool, so now let's zoom in and look a single unit in the hidden layer.

This, for example, let's call it Z2, is the second neuron in the first layer, it is the same perception that we saw before calculating its response by taking a DOT product of its weights with its inputs adding a bias and then applying a nonlinearity if we take a different hidden node like Z3, the one just below, we would calculate its answer in exactly the same way as we calculated Z2, except that its weights would be different than the weights of Z2, everything else remains exactly the same, it sees the same inputs, but from Of course, you know I'm not actually going to show the Z3 in this image and now this image is getting a little messy, so let's clean things up a little more.

I'm going to delete all the lines now and replace them with just these. these boxes these symbols will denote what we call a fully connected layer, so these layers now denote that everything in our input is connected to everything in our output and the transformation is exactly as we saw before the dot product bias and nonlinearity and again in the code. Doing this is extremely simple with the foundation we have built since the beginning of the class. Now we can simply define two of these dense layers right on our hidden layer on line one with n hidden units and then our output layer with two hidden units. output units, that means the nonlinearity function must be the same between layers.

The nonlinearity function does not need to be the same in each layer. Many times it is for convenience. There are some cases where I also wish it were different. In lesson two you will see that nonlinearities are different even within the same layer, let alone across different layers, but, unless for a particular reason, the general convention is that there is no need to keep them different. Now let's continue expanding our knowledge a little more. Now we want to create a deep neural network, not just a neural network like the one we saw on the previous side.

Now it's deep. All that means is that now we're going to stack these layers on top of each other, one by one, creating more and more. a hierarchical model just in which the final result will now be calculated by going deeper and deeper into the neural network and again doing this in code, follows exactly the same story as before, just cascading these tensorflow layers on top of each other. others and just delve deeper into the network, okay now this is great because now we have at least a solid basic understanding of how to not only define a single neuron but also how to define an entire neural network and you should be able to explain it at this point or understand how information goes from the input through a complete neural network to calculate an output, so now let's see how we can apply these neural networks to solve a very real problem that I'm sure everyone is interested in, so here's a problem about how we want to build an artificial intelligence system to learn to answer the following question: will I pass this class well?

I am sure all of you are really worried about this question, so to do this, let's start with a simple input function. model the characteristic the two characteristics we'll be dealing with will be number one, how many conferences you attend and number two, how many hours you spend on your final project, so let's look at some of the last few years of this class, right? look at how different people have lived in this space right between how many lectures and how much time you have spent on your final project and you can see that each dot is a person, the color of that dot will be if you passed or failed. the class and you can see and visualize this kind of V, this feature space, if you want, that we talked about before and then we have you, you land here, you are point 45, right between this feature space.

You've attended four lectures and will spend 5 hours on the final project and want to build a neural network to determine, given everyone else in the class, the right you've seen in all the previous years. Do you want to help. Do you want to have. Your neural network helps you understand what your probability is of passing or failing this class, so let's do it. We now have all the building blocks to solve this problem using a neural network. Let's make it so that we have two inputs. Those entries are. the number of conferences you attend and the number of hours you spend on your final project are four and five, we can pass those two inputs to our two variables uh X1 and X2, these are fed into this single layer neural network, a single hidden layer, it has three hidden ones. units in the middle and we can see that the final predicted exit probability for you to pass this class is 0.1 or 10% correct, so the result is very bleak, not a good result.

The actual probability is one correct, so four out of five will attend. lectures and you spent 5 hours on your final project, you actually lived in a part of the performance space, which was actually very positive, it looked like you were going to pass the class, so what happened here? Does anyone have any ideas? Why did the neural network appear? this is so horribly wrong, it's not trained exactly, so this neural network is not trained, we haven't shown any of that data, the green and red data, so you should really think of neural networks as babies right before See data you haven't seen. learned something, there are no expectations that we should have for them to be able to solve any of these types of problems before we teach them something about the world, so let's first teach this neural network something about the problem and, to train it, we first need to tell our neural network when it is making bad decisions correctly, so we have to teach it correctly, actually train it to learn exactly how we as humans learn in some way, so we have to inform the neural network when it gets the wrong answer so that you can learn how to get the answer right, so the closer the answer is to the ground truth, the more correct, for example, the actual value for you to pass this class was a probability of 100%, but you predicted a probability of 0 ,1.

We calculate what is called loss, so the closer these two things are, the smaller your loss should be and the more accurate your model should be, so let's say we have data from not just one student, but now we have data from many students . We've taken this class before and we can plug them all into the neural network and show them to this system. Now we care not only about how the neural network did on just this prediction, but also how it predicted on all of them. these different personas that the neural network has shown in the past also during this training and learning process, so when we train the neural network we want to find a network that minimizes the empirical loss between our predictions and those ground truth results and we go to do this on average across all the different inputs that the model has seen if we look at this binary classification problem right between itself and NOS, will I pass the class or will I fail?

It's a zero or one. probability and we can use what's called the softmax function or the softmax cross-input function to be able to report whether this network is getting the right or wrong answer. The softmax cross or cross entropy function thinks of this as an objective function. is a loss function that tells our neural network how far apart these two probability distributions are correct, so the output is a probability distribution. We are trying to determine how bad the response the neural network predicts is so we can give itfeedback to get a better answer now, suppose instead of training or predicting a binary output we want to predict an output with real value like any number, you can take any more or less infinite number, for example, if you wanted to predict the rating uh that The right that you get in a class doesn't necessarily have to be between Z and one or Z and 100.

Even now you could use a different loss to produce that value because our outputs are no longer a probability distribution, so, for example, what ? What you could do here is calculate a mean squared error probability or a mean squared error loss function between your true value or your true class grade and the predicted grade. These are two numbers, they are not necessarily probabilities, you calculate their difference, you square it. to look at a distance between the two, an absolute distance, the right sign doesn't matter and then you can minimize this okay, okay, so let's put all this loss information with this problem of finding our network into a unified, unified problem. solution to train our neural network so that we know that we want to find a neural network that solves this problem with all this data on average, this is how we contextualized this problem earlier in the lectures, this effectively means that we are trying. to solve or we are trying to find what are the weights of our neural network what are this big Vector W that we talked about earlier in the lecture?

What is this Vector W? Calculate this W Vector for me based on all the data we have seen right now, the W vector will also determine what the correct loss is, so given a single w vector we can calculate how poorly this neural network is performing in our data, so what is the loss? What is this deviation? the ground truth of our network, uh, based on where it should be, now remember that W is just a bunch of a bunch of numbers, right, it's a very large list of numbers, a list of weights, uh, for each layer and each neuron in our Neural Network, so it's just a very large list or vector of weights that we want to find.Vector, what is that vector based on a large amount of data?

That's the problem with training a neural network and remember that our loss function is just a simple function of our weights if we only have two weights in our neural network as we saw earlier in the slide. We can plot the lost landscape over this two-dimensional space, so we have two weights W1 and W2 and for each configuration or configuration of those two weights our loss will have a particular value which here we are showing is the height of this Plot well for any W1 and W2 what is the loss and what we want to do is find the lowest point what is the best loss where what are the weights so that our loss is the best possible so the lower the loss the better so we want to find the lowest point low on this graph now, how do we do it right?

So the way this works is we start somewhere in this space, we don't know where to start, so let's pick a random place to start right now from there. Instead, let's calculate what is called the gradient of the landscape at that particular point, this is a very local estimate of where it is rising, basically, where the slope is increasing at my current location, that tells us not only where the slope is increasing But more importantly, where the slope decreases, if I deny the direction, if I go in the opposite direction, I can go down into the landscape and change my weights so that it reduces my loss, so let's take one small step, just one small step in the opposite direction of the part that goes up, let's take a small step down and we'll keep repeating this process, we'll calculate a new gradient at that new point and then we'll take another small step and we'll keep doing this over and over again. and again until we converge on what is called a local minimum right, so depending on where we start, it may not be a global minimum everywhere in this lost landscape, but now we are at a local minimum and we have the guarantee that we will actually converge by following this very simple algorithm into a local minimum, so let's now summarize this algorithm, this algorithm is called gradient descent, let's summarize it first in pseudo code and then we will see it in real code in a second, so there are some steps , the first step is we initialize our location somewhere randomly in this weight space, we calculate the gradient of our loss with respect to our weights, okay and then we take a small step in the opposite direction and keep repeating this in a loop and again. again and we say we keep going, we're going to keep doing this until convergence right until we basically stop moving and our Network basically finds where it's supposed to end up, we'll talk about this, this little step, so we're multiplying our gradient by what I keep calling It's a small step, we'll talk about that a little more in the later part of this lecture, but for now we'll also very quickly show the analogous part in the code and it reflects very well. we'll initialize our weight randomly, this happens every time you train a neural network, you have to initialize the weights randomly and then you have a loop here that shows it without even convergence, right, we're going to keep looping forever where we say okay. we're going to calculate the loss at that location, calculate the gradient, so the direction is up and then we just negate that gradient, multiply it by what's called learning rate LR, which is denoted here as a small step and then we take a direction in that small step, so let's take a deeper look at this term here, this is called gradient, this tells us which way is up in that landscape and this again tells us even more than that, it tells us what it is like. our landscape, how our loss is changing as we move forward. a function of all our weights, but I haven't actually told you how to calculate this, so let's talk about that process, that process is called backpropagation.

We'll go over this very, very briefly and start with the simplest neural network. that is possible, so we have already seen the simplest building block, which is a single neuron. Now let's build the simplest neural network, which is just a one neuron neural network, so it has a hidden neuron that goes from input to hidden neuron to output and we want to calculate the gradient of our loss with respect to this weight W2, it is okay, so I'm highlighting it here so that we have two weights, let's first calculate the gradient with respect to W2 and that tells us how much a small change in w 2 affects our loss our loss increases or decreases if we move our W2 a little bit in a direction or another, so let's write this derivative.

We can start by applying the chain rule backwards from the loss to the exit and we can specifically decompose this law. this derives this gradient into two parts, so the first part we are decomposing from DJ dw2 into DJ Dy, which is our output multiplied by Dy dw2, all of this is possible, it's a chain rule, it's just me reciting. a chain rule here from the calculation this is possible because Y only depends on the previous layer and now suppose we don't want to do this for W2 but we want to do it for W1 we can use exactly the same process but now it's one more step now we will replace W2 with W1.

We need to apply the chain rule one more time to further decompose the problem and now we propagate our old gradient that we calculated for W2 all the way, one more step, uh to the weight we are interested in, which in this case is W1 and we continue repeating this process over and over again propagating these gradients back from the output to the input to ultimately calculate what we want in the end is this derivative of each weight, so the derivative of our loss with respect to each weight in our neural network, this tells us how much a small change in each weight in our network affects the loss.

Does our loss increase or decrease if we shift this weight a little in this direction? or a little in that direction, yes, I think you use the term neuron as perceptron, is there a functional difference? Neuron and perceptron are the same, so usually people say neural network, so as a single neuron, it has also gained popularity, but originally a perceptron is is the formal term the two terms are identical Okay, now we have covered a lot, so we've covered forward propagation of information through a neuron and through a neural network all the way and now we've covered backward propagation. of information to understand how we should change each of those weights in our neural network to improve our loss, so that was the reverse propop algorithm in theory, it's actually quite simple, it's just a chain rule, right, no there's nothing, there's really nothing more than just the chain rule and the nice part that deep learning libraries actually do this for you, so they calculate the backing for you, you don't actually have to implement it yourself , which is very convenient, but now it's important to address it even though the theory is actually not that complicated for backpropagation.

Let's address it now from practice. Now, thinking a little bit about your own implementations, when you want to implement these neural networks, what are some ideas, so optimizing neural networks in practice is a completely different story. It's not simple. not at all and in practice it is very difficult and usually very computationally intensive to do this backup algorithm, so here is an illustration from a paper that came out a few years ago that actually attempted to visualize the lost landscape of a very deep neural network, so previously we had that other uh, representation, visualization of what a neural network would look like in a two-dimensional landscape, real neural networks are not two-dimensional, they have hundreds, millions or billions of dimensions and now , what would those apes lost in the landscape look like?

You can try something smart. techniques to actually visualize them, this is an article that tried to do that and it turns out they look extremely messy, right? The important thing is that if you do this algorithm and you start in a bad place depending on your neural network, you may not actually end up in the global solution, so your initialization is very important and you need to go through these local minima and try to help you To find the global minima or even more than that, you need to build neural networks that have lost landscapes that are much more amenable to optimization than this one, so this is a very bad lost landscape.

There are some techniques that we can apply to our neural networks that smooth out their lost landscape and make them easier to optimize, so remember the update equation that we talked about earlier with gradient descent, so here's this parameter that we didn't talk about. We describe it as the small step you can take correctly, so it's a small number that is multiplied with the direction that is your gradient. tells you, okay, I'm not going to go all the way in this direction, I'm just going to take a small step in this direction, so in practice even setting this value correctly is just a number, setting this number can be quite difficult , TRUE?

If we set the learning rate too small then the model may get stuck at these local minima, so here it starts and gets stuck at these local minima, it converges very slowly even if it doesn't get stuck if the learning rate is too large, it can overshoot and, in practice, even diverge and explode, and minima are never actually found. Ideally, what we want is to use learning rates that are neither too small nor too large to be large. enough to basically avoid those local minima, but small enough that they won't diverge and in fact will still find their way to global minima, so something like this is what you intuitively should have in mind for something that may exceed local minima. but it finds itself at a better minimum and then eventually stabilizes there, so how do we actually set these learning rates correctly in practice?

What does that process look like now? Idea number one is very basic. It's about trying a bunch of different learning rates and seeing. which works and it's actually not a bad process in practice, it's one of the processes that people use so that's interesting but let's see if we can do something smarter than this and see how we can design algorithms that can adapt . Landscapes are correct, so in practice there is no reason this should be a single number. Can we have learning rates that adapt to the model, to the Landscapes data to the gradients seen around us?

This means that the learning rate can actually increase. or decrease based on the gradients in the loss function, how fast we are learning or many other options, there are many different ideas that could be done here and soIn fact, there are many different procedures or methodologies widely used to shape learning. rate and during your labs we encourage you to try out some of these different ideas for different types of learning rates and even play with yourself to see what the effect is of increasing or decreasing your learning rate. You will see very surprising differences. because it's in a close interval, why not just find the absolute minimum?

You know, try well. So a few things, what's number one? It is not a closed space. So, there is an infinity. Each peso can be more or less to infinity. So even if it were a one dimensional neural network with a single weight, it's not a closed space in practice, it's even worse than that because you have billions of dimensions, so your space is not only your support system in a dimension, but it is infinite. but now you have billions of infinite dimensions or billions of infinite support spaces, so it's not something where you can just search for every weight, every possible weight in your neural configuration or what is every possible weight that this network neural could take and let me try them because it's not practical to do that even for a very small neural network in practice, so in your labs you can really try to put all this information into practice in this image that defines your number one model right here. your Optimizer, which we previously denoted as this Gradient Descent Optimizer, here we call it stochastic gradient descent or SGD, we'll talk about that more in a second and then also keep in mind that your Optimizer, which we call SGD here, could be any of these adaptive optimizers you can swap them and you should swap them, you need to try different things here to see the impact of these different methods on your training procedure and you will gain very valuable intuition for the different insights that will come with that.

Also, I want to continue very briefly to the end of this lecture to talk about tips for training neural networks in practice and how we can focus on this powerful idea of what is actually called batching data without seeing all the data, but now. talking about a topic called batch processing, so to do this, let's very briefly review this gradient descent algorithm, gradient is to calculate this gradient calculation, the backtracking algorithm that I mentioned before, it is a very expensive operation from the point from a computational point of view and it is even worse because we now described it above. do it in a way where we would have to calculate it over a sum of each data point in our entire data set.

This is how we define it with the loss function. It's an average of all our data points, which means we're adding up. In all of our data, gradients point, so in most real-life problems this would be completely infeasible because our data sets are simply too large and the models are too large to compute gradients in each iteration, remember This is not a one-time thing, it's true, it's every step you take, you keep taking baby steps, so you still need to, you still need to repeat this process, so instead, let's define a new descent algorithm. called stochastic gradient SGD.

Gradient Descent Instead of calculating the gradient over the entire data set, now let's choose a single training point and calculate that training point gradient correctly. The good thing about this is that it is much easier to calculate that gradient correctly, you only need one point and the downside is that it is very noisy, it is very stochastic since it was calculated using just that example, so you have that offset that exists, so What is the middle ground? The middle ground is to take not one data point nor the entire data set but rather a batch of data, so take what is called a mini batch, this could be something in practice like 32 pieces of data is a size of common batch and this gives us an estimate of the true gradient, so we approximate the gradient by averaging the gradient. of these 32 samples is still fast because 32 is much smaller than the size of the entire data set, but now it is quite fast, it is still noisy, but it is generally fine in practice because you can still iterate much faster and since B is usually not that large.

Think again of something like tens or hundreds of samples. It is very fast to calculate this in practice compared to regular gradient descent and it is also much more accurate compared to stochastic gradient descent and the increase in precision of this gradient estimation allows us to converge. to our solution significantly faster, plus it's not just about the speed, but also the increase in precision of those gradients that allows us to get to our solution much faster, which ultimately means that we can also train much more fast and we can save computation and The other really good thing about mini batches is that they allow us to parallelize our calculation correctly and that was a concept that we had also talked about earlier in class and this is where it comes into play: we can divide those batches correctly to that those 32 pieces of data, let's say if our batch size is 32, we can split them into different workers, different parts of the GPU can address those different parts of our data points, this can basically allow us to achieve even more significant speedup using GPU and GPU architectures.

The hardware is fine, finally, the last topic I want to talk about before we end this lecture and move on to lesson number two is overfitting, so overfitting is this idea that is actually not a problem focused on deep learning in Absolutely, it is a problem that exists in everyone. machine learning, the key problem is that, and the key problem is actually one that addresses how you can precisely define whether your model is actually correctly capturing your true data set or whether it is just learning the subtle details that are more or less intelligent. correlating with your data set, put another way, let me say it a little differently now, so let's say we want to build models that can learn representations of our training data that still generalize to new unseen test points, that's the real deal target here.

We want to teach our model something based on a lot of training data, but then we don't want it to perform well on the training data, we want it to perform well when we deploy it in the real world and see things it has never seen. seen during training, so the concept of overfitting addresses exactly that problem, overfitting means that if your model is performing very well on your training data but very poorly when tested professionally, that means you are overfitting, you are overfitting to the training data you saw on the On the other hand, there is also insufficient fitting on the left side;

You can see that you're basically underfitting the data, which means you know you're going to achieve very similar performance on your test distribution, but ideally they'll both underperform the actual capabilities of your system. You want to end up somewhere in the middle, that's not too complex, where you're memorizing all the nuances of your training data, like on the right, but you still want to continue performing well even based on the new data, so we're not either. insufficiently adapted, so to really address this problem in neural networks and in machine learning in general, there are a few different ways that you need to know and how to do it because you will need to apply them as part of your Solutions and your software labs as well, so the key concept here is called regularization.

Correct regularization is a technique you can introduce, and simply put, all regularization is a way to discourage your model from learning these nuances in your training data. everything is and, as we have seen before, it is actually essential that our models can generalize. You know, not just on the training data, but really what we care about is the test data, so the most popular regularization technique that's important for you to understand. Is this a very simple idea called Abandonment? Let's review this image of a deep neural network that we have been seeing at all the conferences.

I abandon our training during training. What we're going to do is randomly set some of the activations right at these outputs of each single neuron to zero, we're just going to randomly set them to zero with some probability, so let's say 50% is our probability, which means that we're going to take all the activation in our neural network and with a 50% probability before passing that activation to the next neuron, we'll just set it to zero and not transmit anything as effectively. 50% of the neurons will be turned off or killed on a forward pass and you're only going to forward information with the other 50% of your neurons, so this idea is extremely powerful actually because it reduces the capacity of our neural network, it doesn't. not only reduces the capacity of our neural network but is dynamically reducing it because in the next iteration, we will choose a different 50% of neurons that we abandon so constantly that the network will have to learn to build different paths from input to output and that cannot depend on anything small. part of the features that are present in any part of the training data set are too correct because you are constantly forced to find these different paths with random probabilities, so dropout, the second regularization technique, will be this notion called stopping early, which is actually something that is model independent, you can apply this to any type of model as long as you have a test suite that you can play with, so the idea here is that we already have a fairly formal mathematical definition of which means correct overfitting. is just when our model starts performing worse on our test set, that's really all that's good, so what if we plot the course of training so that the x-axis is while we train the model, let's see the performance in both training sessions? set and the test set, so at first you can see that the training set and the test set are going down and continuing to go down, which is great because it means that our model is getting stronger eventually, although what you will notice is that the test loss stabilizes and starts to increase, training loss on the other hand there is no reason why training loss should stop going down, training losses generally always continue to decay as long as there is capacity in the neural network to learn those differences correctly, but the important point is that this continues for the rest of the training and we want to BAS basically we care about this point right here, this is the really important point because this is where we should stop training right after this point.

This is the midpoint because after this point we start overfitting parts of the data where our training accuracy becomes actually better than our testing accuracy, so our testing accuracy is going bad, it's getting worse, but our testing accuracy training is still improving, so it means overfitting. On the other hand, on the left side, this is the opposite problem. We have not fully utilized the capacity of our model and the testing accuracy can still improve further. Right, this is a very powerful idea, but it's actually extremely easy to implement in practice because all you really have to do is simply monitor the time loss over the course of training and you just have to choose the model where testing accuracy is starting to get worse so I'm going to conclude this lecture by simply summarizing three key points that we've covered in the class so far and this is a very comprehensive class so the whole week is going to be like this and today is just the beginning until Now we have learned the fundamental components of the neural network starting all the From a single neuron, also called a perceptron, we learned that we can stack these systems on top of each other to create a hierarchical network and how we can mathematically optimize those types of systems and finally , in the last part of the class, We talk only about techniques, tips and techniques for training and applying these systems in practice.

Now in the next lecture we're going to hear from Ava about deep sequence modeling using rnns and also about a really new and exciting algorithm and type of model called Transformer, which is built off of this attention principle, you'll learn about it in the next class, but for now let's take a quick break and come back in about five minutes so we can change the speakers and Ava can start it up. good presentation thank you

Watch Video & Subscribe

If you have any copyright issue, please Contact