How Deep Neural Networks Work - Full Course for Beginners

May 30, 2021

Neural

net

work

s are good at learning many different types of patterns, to give an example of how this would

work

. I imagine you had a four pixel camera, so not four megapixels, but just four pixels and it was just black and white and you wanted to go. around and take pictures of things and automatically determine whether these pictures were a completely white or completely dark solid image a vertical line or a diagonal line or a horizontal line. This is tricky because you can't do this with simple rules about the brightness of the pixels, they are both horizontal lines, but if you try to make a rule about which pixel is bright and which is dark, you won't be able to do it, so to do this with the

neural

network, you start by taking all of its inputs in this case there are four pixels and divide them into input neurons and assign a number to each of them depending on the brightness or darkness of the pixel plus one is completely white minus one is completely black and so the gray is zero right in the middle, so these values, once you've split them up and enumerated them like that across the input neurons, it's also called an input vector or matrix, it's just a list of numbers that represents your inputs right now, it's a useful notion to think about the receptive field of a neuron, this all means what set of inputs makes the value of this neuron as high as possible for the input neurons, this is pretty easy, each is associated with a single pixel and when that pixel is all the way that input neuron's value is white is as high as possible, the black and white checkered areas show pixels that an input neuron doesn't care about whether they are completely white or completely black, he still doesn't care.

It doesn't affect the value of that input neuron at all. Now, to build a

neural

network, we create a neuron. The first thing it does is add up all the values of the input neurons, so in this case if we add up all those values, get a 0.5 now to complicate things a little bit, each of the connections is weighted, which means that are multiplied by a number. That number can be one or minus one or any value in between, so, for example, if something has a weight of minus one, it is multiplied. and you get the negative and that's added, if something has a weight of zero then it's effectively ignored, so this is what those weighted connections might look like and you'll notice that after the input neuron values are weighted and values are added.

More Interesting Facts About,

how deep neural networks work full course for beginners...

Can the final value be completely different graphically? It is convenient to represent these weights as white links with positive weights. The black links have negative weights and the thickness of the line is approximately proportional to the magnitude of the weight. Then after adding the weighted input neurons, they get squished and I'll show you what that means you have a sigmoid squish function sigmoid simply means s-shaped and what this does is put a value at let's say 0.5 and pass a vertical line to your sigmoid and then a horizontal horizontal line from where it intersects and then where it hits the y axis, which is the output of your function, so in this case a little less than 0.5 is pretty close as your input number increases, your output number also increases, but more slowly and eventually it doesn't matter. how big the number you put in the answer is always less than one.

Similarly, when you are negative, the response is always greater than negative one, so this ensures that the value of that neuron never goes out of the range of plus one to minus one, which is useful for keeping computations in the network. limited and stable neuronal cell, so after adding the weighted values of the neurons and crushing the result, you get the output, in this case 0.746, which is a neuron, so we can call it, we can collapse all of that and this is a neuron that does a weighted sum and squashes the result and now instead of just one of those, let's say you have a bunch, four are shown here, but now there could be 400 or 4 million to keep our picture clear, we'll assume for now. that the weights are more a white line minus a black line or zero, in which case they are completely missing, but actually all of these neurons that we created are linked to all the input neurons and they all have some weight between minus one and plus one when We create this first layer of our neural network, the receptive fields become more complex, for example, here each of them ends up combining two of our input neurons and therefore the value of the receptive field is the pixel values that They form that first layer of neuron. as large as possible, they now look like pairs of pixels, either all white or a mix of black and white, depending on the weights, so for example, this neuron here is attached to this input pixel which is in the top left and to this input pixel below. left and both weights are positive, so it combines the two and that's its receptive field, this one's receptive field plus this one's receptive field, however, if we look at this neuron, it combines our top right pixel and this pixel. bottom right has a weight of minus one for the bottom right pixel, which means it's most active when this pixel is black, so here's your receptive field now, because we were careful when creating that first layer, its values are They look a lot like input values and we can go around and create another layer on top of it in exactly the same way with the output of one layer being the input to the next layer and we can repeat this three times or seven times or 700 times for additional layers each.

Over time, receptive fields become even more complex, so you can see here using the same logic. Now they cover all the pixels and with a more special arrangement of which are black and which are white, we can create another layer again, all these neurons in one. The layer is connected to all the neurons from the previous layer, but here we assume that most of those weights are zero and are not shown, this is usually not the case, so to mix things up we will create a new layer, but if you You realize our squashing function no longer exists, we have something new called a rectified linear unit.

This is another popular type of neuron, so you do the weighted sum of all your inputs and instead of squashing, you do rectified linear units, you rectify it, so if it's negative. you make the value zero, if it is positive you keep the value. Obviously this is very easy to calculate and it turns out that it also has very good stability properties for neural

networks

in practice, so after doing this, some of our weights are positive and some are positive. are negative when connected to those rectified linear units, we get receptive fields and they are opposite if you look at the patterns at that time, finally when we have created as many layers with as many neurons as we want, we create an output layer here we have the four outputs that we are interested in They are the solid, vertical, diagonal or horizontal image.

To see an example of how this would work, let's say we start with this input image shown on the left, dark pixels on top, white pixels on the bottom as we propagate that to our input layer this is how it looks. you would see those values, the top pixels, the bottom pixels, as we move them to our first layer, we can see the combination of a dark pixel and a light pixel, some together give us zero gray, while down here we have the combination of a dark pixel plus a light pixel with a negative weight, which gives us a negative value, which makes sense because if we look at the receptive field here, the top left white pixel, the bottom left black pixel is exactly the opposite. of the input we are getting, so we would expect its value to be as low as possible minus one.

As we move to the next layer we see the same types of things combining zeros to get zeros i.e. combining a negative and a negative with a negative weight which makes a positive get a zero and here we combine two negatives to get a negative, so again you'll notice that the receptive field of this is exactly the inverse of our input, so it makes sense that its weight is negative. or its value would be negative and we go to the next layer, all of these, of

course

, these zeros are propagated forward here, this is negative, it has a negative value and it has a positive weight, so it just moves forward because we have a rectified linear. unit negative values become zero so now it is zero again two but this is rectified and becomes positive negative multiplied by a negative is positive and when we finally get to the output we can see that they are all zero except this horizontal one which is positive. and that's the answer, our neural network said this is an image of a horizontal line.

Now neural networks are generally not that good or that clean, so there is a notion of with one input what is the truth in this case, the truth is that this has a zero for all of these values but one for horizontal is not robust. It is not vertical It is not diagonal It is horizontal Yes it is horizontal An arbitrary neural network will give answers that are not exactly true It may be off by a little or a lot and then the error is the magnitude of the difference between the truth and the given answer and it can add up all this to get the total error of the neural network, so the general idea with learning and training is to adjust the weights so that the error is as low as possible.

So the way you do this is by placing an image, we calculate the error at the end, then we look at how to adjust those weights higher or lower to make that error go up or down, and of

course

we adjust the weights in the form . that makes the error go down now the problem with doing this is that every time we go back and calculate the error we have to multiply all those weights by all the values of the neurons in each layer and we have to do it over and over again once for each weight, this takes forever in computational terms at computer scale, so it's not a practical way to train a large neural network that you can imagine, instead of just rolling to the bottom of a simple valley, we have a valley of very high dimensions. high and we have to find our way down and because there are so many dimensions, one for each of these weights, that the calculation becomes prohibitively expensive, fortunately there was an idea that allows us to do this in a very reasonable time and that is that If we are careful with how we design our neural network we can calculate the slope directly the gradient we can determine the direction in which we need to adjust the weight without having to completely go back through our neural network and recalculate, so just check The slope we are talking about is when we make a change in weight, the error will change a little and that relationship of the change in weight with the change in error is the slope.

Mathematically there are several ways to write this, we will prefer the one in the part bottom, it's technically more correct, we'll call it d-e-d-w for short every time you see it, just think about the change by mistake when I change a weight or the change in the top when I change the bottom um, this is, uh, it goes in in a bit of calculus, we take derivatives, this is how we calculate the slope, if this is new to you, I highly recommend a good semester of calculus just because the concepts are very universal and many of them have very good physical interpretations that I find very attractive , but Otherwise, don't worry, just ignore this and pay attention to the rest and you'll have a general idea of how it works.

In this case, if we change the weight by plus one, the error changes by negative two, which gives us a slope. of minus two which tells us the direction in which we should adjust our weight and how much we should adjust it to reduce the error. Now, to do this, you need to know what your error function is, so let's say we had an error function that was the square. of the weight and you can see that our weight is right at -1 so the first thing we do is take the derivative change in error divided by the change in weight d e d w the derivative of the weight squared is twice the weight and then we plug in our weight is negative one and we get a slope d e d w of negative two.

Now the other trick that allows us to do this with

deep

neural networks is chaining, and to show you how it works, imagine a very simple trivial neural network with just one hidden layer and one input. layer, an output layer and a weight that connects each of them, so it is obvious to see that the y value is just the x value multiplied by the weight that connects them w1, so if we change w1 a little, we simply we take the derivative of y with respect to w1. and we get x the slope is d e d y is just w 2.

Because this network is so simple, we can calculate from one end to the other x times w1 times w2 is the error e, so if we want to calculate how much the error will change if I change w1, just take the derivative of that with respect to w1 and get x multiplied by w2, so this illustrates. You can see here now that what we just calculated is actually theproduct of our first derivative we take dydw1 multiplied by the derivative for the next step d and d and multiplied together, this is chaining, you can calculate the slope of each small step and then multiply them all together to get the slope of the entire chain, the derivative of the

full

chain, so in a

deep

er neural network, what this would look like is if I want To know how much the error will change if I adjust a weight that is deep in the network, I simply calculate the derivative of each small step until I reach the weight I'm trying to calculate and then multiply them all together computationally.

It is many times cheaper than what we had to do before recalculating the error for the entire neural network for each weight. Now in the neural network that we have created there are several types of backpropagation that we have to do. There are several operations that we must do. we have to do for each of them, we have to be able to calculate the slope, so for the first one it is just a weighted connection between two neurons a and b, so suppose we know the error change with respect to b, we want to know the error change with respect to a, to get there we need to know db da, so to get it we simply write the relation between b and a we take the derivative of b with respect to a we get the weight w and now we know how to take that step we know how to do that little nugget of back propagation Another element that we have seen is the sums of all our neurons summarize many inputs to take this backpropagation step, we do the same, we write our expression and then we take the derivative of our end point z with respect to our step that we are propagating to a and dz gives in this case it's just one which makes sense if we have a sum of a bunch of elements we increase one of those elements by one we expect the sum to increase by one that is the definition of a slope of a one to one relationship there is another element What we have and what we need to be able to propagate backward is the sigmoid function, so this one is a little more interesting mathematically.

We'll just write it in shorthand like this, the sigma function, um, it's completely feasible to go ahead and take the derivative. From this analytically and calculate it, it turns out that this function has a good property that to obtain its derivative you simply multiply it by 1 minus itself, so it is very simple to calculate another element that we have used is the rectified linear unit again. To figure out how to propagate this backwards we simply write the relation b equals a if a is positive, otherwise it is zero and piecewise for each of them we take the derivative, so db da is one if a is positive or zero , and so on.

With all these small backpropagation steps and the ability to chain them together, we can calculate the effect of fitting any given weight on the error for any given input, and to train, we start with a

full

y connected network that we don't know what. any of these weights should be, so we assign them all random values, we create a completely arbitrary random neural network, we enter an input whose answer we know, we know if it is a solid vertical or horizontal diagonal, so we know what the truth should be and to so we can calculate the error, then we run it, calculate the error and use backpropagation, adjust all those weights a little bit in the right direction and then do it again with another input and again with another input if you can get away with a lot thousands or even millions of times and eventually all those weights will gravitate and roll down that multi-dimensional valley to a nice low point at the bottom where it works very well and is pretty close to the truth for the most part. of the images, if we're very lucky, it will look like what we started with with intuitively understandable receptive fields for those neurons and a relatively sparse representation, meaning that most of the weights are small or close to zero and that's not always the case. will turn out that way, but what we're guaranteed is that you'll find a pretty good representation of the best you can do by adjusting those weights to get as close as possible to the correct answer for all the inputs, so what do we do?

What we have covered is just a very basic introduction to the principles behind neural networks. I haven't told you enough to be able to go out and build your own, but if you feel motivated to do so, I recommend doing it here. some resources that you will find useful you will want to go and learn about bias neuron dropout it is a useful training tool there are several resources available from andre cartathi who is an expert in neural networks and great at teaching about it there is also a fantastic article called the black magic of deep learning which just has a lot of practical advice from the trenches on how to make them work well neural networks are famous for being difficult to interpret, it's hard to know what they are really learning when we train them, so let's go Take a closer look and see if we can get a good picture of what's going on inside, just like any other supervised machine learning model.

Neural networks learn the relationships between input variables and output variables; in fact, we can even see how it relates to the most iconic model. Of all linear regression, simple linear regression assumes a straight line relationship between an input variable x and an output variable and x is multiplied by a constant m which also happens to be the slope of the line and added to another constant b which results be where the line crosses the y axis we can represent this in an image our input value x is multiplied by m our constant b is multiplied by one and then added to get y this is a graphical representation of y is equal to m x plus b in On the far left, the circular symbols only indicate that the value is passed through the rectangles l labeled m and b indicate that everything that enters from the left exits multiplied by m or b from the right and the box with the capital sigma indicates that everything that goes what is on the left is added and what is spit out to the right we can change the names of all the symbols for a different representation.

This is still a straight line relationship. We just changed the names of all the variables, the reason we are doing it. This is to translate our linear regression into the notation we will use in neural networks. This will help us keep track of things as we go. At this point we have converted a straight line equation into a network. A network is anything that has connected nodes. for edges in this case x sub 0 and x sub 1 are our input nodes v sub 0 is an output node and our weights that connect them are edges. This is not the traditional sense of a graph which means a graph or a grid as in a graphing calculator. or graph paper is just the formal word for a network of nodes connected by edges.

Another terminology you may hear is a directed acyclic graph abbreviated as d-a-g or dag. A directed graph is one where the edges simply go in one direction in our case. goes to the output but the output never returns to the input our edges are directed acyclically meaning you can never draw a loop once you have visited a node there is no way to jump from edges to nodes, from edges to nodes to get back to where you were. Since everything flows in one direction across the graph, we can get an idea of the type of models this network is capable of learning by choosing random values for the weights w sub 0 and w sub 1 and then seeing what relationship emerges between x sub 1 and v sub 0.

Remember that we set x of 0 equal to 1 and keep it there always. This is a special node called the bias node. It should be no surprise that the relationships arising from this linear model are all straight lines, after all, we have taken our equation for the line and rearranged it, but we have not changed it in any substantial way, there is no reason we should limit ourselves to a single input variable, we can add an additional one now here we have an x of 0 an x sub 1 and an x sub 2. we draw an edge between x sub 2 and our sum with the weight w sub 2 0 x sub 2 multiplied by w sub 2 0 is again u sub 2 0 and all our u's add up to make v sub 0. and we could add more entries as many as we want.

This is still a linear equation, but instead of being two-dimensional, we can make it three-dimensional or higher. Writing this mathematically could be very tedious, so we will use a shortcut we will replace the subscript i with the index of the entry is the number of the entry we are talking about this allows us to write u sub i zero where our u sub i is equal a x sub i multiplied by w sub i zero and again our output v sub zero is just the sum of all the i values of u sub i zero. For this three-dimensional case, we can look again at the patterns that arise when we randomly choose our w sub i zeros as our weights.

In this case, we would expect to obtain the three-dimensional equivalent of a line and a plane and if we extended this to more inputs we would obtain the m-dimensional equivalent of a line that until now is called an m-dimensional hyperplane. now we can start to get more sophisticated, our input x sub 1 looks a lot like our output v sub zero, in fact there is nothing stopping us from taking our output and then using it as input to another network like this, now we have two. We separate identical layers, we can add a Roman numeral with subscript i and a Roman numeral with subscript i i or two to our equations depending on which layer we are referring to and we just have to remember that our x sub 1 in layer 2 is the same as our v sub 0 on layer 1.

Because these equations are identical and each of our layers works the same, we can reduce this to a set of equations by adding a capital l subscript to represent which layer we are talking about as we continue. here we will assume that all layers are identical and to keep the equations cleaner we will omit the capital l, but note that if we were completely correct and detailed we would add the subscript l at the end of everything to specify the layer it belongs to now that we have two layers there is no reason we can't connect them in more than one place instead of our first layer generating just one output we can make multiple outputs in our diagram.

We'll add a second output v sub 1 and connect it to a third input on our second layer x sub 2. Note that the input x sub 0 of each layer will always be equal to 1. That bias node appears. again in each layer now there are two nodes shared by both layers, we can modify our equations accordingly to specify which of the shared nodes we are talking about behave exactly the same so we can be efficient and reuse our equation, but we can specify the subscript j for indicate which output we are talking about so now if I am connecting input i to output j then i and j will determine what weight is applied and which u's are added to create the output v sub j and we can do this as many times as we want we can add as many of these shared nodes as we want so that the model as a whole only knows the input x sub 1 in the first layer and the output v sub 0 of the last layer from the point of view of someone sitting outside the model the nodes shared between layer 1 and layer 2 they are hidden they are inside the black box because of this they are called hidden nodes we can take this two layer linear network create hundred hidden nodes set all the weights randomly and see what model it produces even after adding throughout this structure, the resulting models are still straight lines, in fact, no matter how many layers you have or how many hidden nodes each layer has, any combination of these linear elements with the weights and sums will always produce a straight line result.

This is actually one of the features of linear calculus that makes it so easy to work with, but unfortunately for us it also results in really boring models. Sometimes a straight line is enough, but that's not why we're going to go to neural networks. We're going to want something a little more sophisticated to get more flexible models. We're going to need to add some nonlinearity. We're going to modify our linear equation here after we calculate our output v. sub 0 we subject it to another function f that is not linear and we will call the result y sub zero.

A very common nonlinear function to add here is the logistic function, it is s-shaped so it is also sometimes called a sigmoid function, although that can be confusing because technically any s-shaped function is sigmoid, we can have a idea of what the logistic functions look like by choosing random weights for this one-input, one-output, one-layer network and fulfilling the family. A notable characteristic of logistic functions is that they live between zero and one for this reason they are also called crush functions. You can imagine taking a straight line and then squashing the edges and bending it and hammering it so that everything fits between zero and one no matter how far away you are.

Working with logistics functions brings us to another connection with machine learning models. Logistic regression. This is a little confusing becauseRegression refers to finding a relationship between an input and an output, usually in the form of a line, curve, or surface of some type. Logistic regression is actually used as a classifier most of the time it finds a relationship between a continuous input variable and a categorical output variable treats observations in one category as zeros treats observations in the other category as units and then finds the logistic function that best fits all those observations, then to interpret the model we add a threshold often around 0.5 and wherever the curve crosses the threshold there is a demarcation line, everything to the left is predicted of that line will fall into one category and everything to the right of that line is predicted to fall into the other.

This is how you modify a regression algorithm to become a classification algorithm, since with linear functions there is no reason not to add more inputs. We know that logistic regression can work with many input variables and we can represent that in our graph. Also, we're only adding one here to keep the graph three-dimensional, but we could add as many as we want to see what kind of features this network can create. We can choose a bunch of random values for the weights as expected. The features we created are still S-shaped, but now they are three-dimensional. They look like a tablecloth placed on two tables of unequal height;

More importantly, if you look at the contour lines projected on the floor of the plot, you can see that they are all perfectly straight, the result of this is that any threshold we choose to perform the classification will divide our input space into two halves. and the divisor will be a straight line. That is why logistic regression is described as a linear classifier regardless of the number of inputs it has, whatever dimensional space you are working in. In logistic regression we will always divide it into two halves using a line, a plane or a hyperplane of the appropriate dimensions. .

Another popular nonlinear function is the hyperbolic tangent, it is closely related to the logistic function and can be written in a very symmetric way. When we choose some random weights and look at examples, we can see that hyperbolic tangent curves look like logistic curves, except that they vary between -1 and plus 1. Just like we tried to do before with linear functions, we can use the output of a layer . as input to another layer, we can stack them like this and we can even add hidden nodes the same way we did before. Here we only show two hidden nodes to keep the diagram simple, but you can imagine as many as you want there.

When we choose random weights for this network and look at the result, we find things get interesting. We have left the realm of linear because the hyperbolic tangent function is not linear. When we add them together, we get something that is not necessarily seen. like a hyperbolic tangent, we get curves, peaks and valleys, and a much wider range of behavior than we've seen with single-layer networks. We can take the next step and add another layer to our network. Now we have a set of hidden nodes between layer 1 and layer 2 and another set of hidden nodes between layer 2 and layer 3.

Again we choose random values for all the weights and look at the types of curves it can produce again, we see wiggles and peaks, valleys, and a wide selection of shapes if it is difficult to distinguish between these curves and the curves generated by a two-layer network because they are mathematically identical. We won't try to prove it here, but there is an interesting result that shows that any curve you can create by learning a menu. Using a many-layer network, you can also create using a two-layer network, as long as you have enough hidden nodes. The advantage of having a many-layer network is that it can help you create more complex curves using fewer nodes in total, for example in our two-layer network. network we use one hundred hidden nodes in our three layer network we use 11 hidden nodes in the first layer and nine hidden nodes in the second layer that is only one fifth of the total number we use in our two layer network but the curves it produces show similar richness we can use these fancy squiggly lines to make a classifier like we did with logistic regression here we use the zero line as a boundary everywhere our curve crosses the zero line there is a divisor in every region the curve lies above the zero line We will call this category a and similarly, everywhere the curve is below the zero line, we have category b.

What distinguishes these nonlinear classifiers from linear classifiers is that they don't just divide the space into two halves in this example, regions of a and b. they are intertwined building a classifier around a multilayer nonlinear network gives it much more flexibility, it can learn more complex relationships this particular combination of multilayer network with hyperbolic tangent nonlinear function has its own name, a multilayer perceptron as you can guess, when you have only one layer, it's just called a perceptron and in that case you don't even need to add the nonlinear function for it to work, the function will still cross the x-axis in the same places here.

The complete network diagram of a multilayer perceptron This representation is useful because it makes all operations explicit, however it is also visually cluttered, it is difficult to work with because of this, it is often simplified to look like circles connected by lines, this involves all operations. We saw in the previous diagram that connects the lines, each one has an associated weight, the hidden nodes and the output nodes perform non-linear addition and squashing, but in this diagram all that is implicit, in fact, our bias nodes, the nodes that always have a value of one in each. The layer is omitted for clarity, so our original network boils down to this.

The polarization nodes are still present and their operation hasn't changed at all, but we omitted them to make a clearer picture. Here we only show two hidden nodes of each layer, but in practice. We used quite a few more again to make the diagram as clean as possible. We often do not show all hidden nodes. We only show a few and the rest is implied. Here is a generic diagram for a three-layer single-input, single-output network. that if we specify the number of inputs, the number of outputs, the number of layers and the number of hidden nodes in each layer, then we can completely define a neural network.

We can also take a look at a two-input, single-output neural network because it has two inputs when we plot its outputs it will be a three-dimensional curve. Once again we can choose random weights and generate curves to see what types of functions this neural network could represent. This is where it gets really fun with multiple inputs. layers and non-linear activation functions neural networks can create really crazy shapes, it's almost correct to say that they can create any shape you want, but it's worth taking a moment to notice what their limitations are. First, notice that all functions lie between plus and minus one, the dark red and dark green regions kiss the floor and ceiling of this range but never cross it.

This neural network could not fit a function that extended outside this range. Also notice that all of these functions tend to be smooth, they have hills, dips, valleys, ripples and even points and wells, but everything happens relatively smoothly, if we expect to fit a function with many irregular jumps and drops, it is possible that this neural network I can't do a very good job. However, aside from these two limitations, the variety of functions this neural network can produce is a bit mind-boggling. We modify a single output neural network to be a classifier when looking at the multilayer perceptron.

Now there is another way to do this. we can use a two output neural network instead of the outputs of a three layer one input to output neural network like this. We can see that there are many cases where the two curves intersect and in some cases they intersect in multiple places. We can use this to make a classifier wherever one output is greater than another it can mean that one category dominates another graphically wherever the two output functions intersect we can draw a vertical line dividing the input space into regions in each region for which one output is greater than the other.

For example, wherever the blue line is largest, we can assign it as category a and then wherever the peach colored line is largest, those regions are category b, just like the multilayer perceptron, this allows us to divide the space in more ways. complex than a linear classifier. Could the regions of category a and category b be mixed arbitrarily when you only have two outputs? The advantages of doing it this way over a multilayer perceptron with a single output are not entirely clear; however, if it goes to three or more outputs. the story changes now we have three separate outputs and three separate output functions we can use our same criteria of letting the function with the maximum value determine the category we start by dividing the input space according to which function has the highest value each function represents of our categories we will assign our first function as category a and label each region where it is at the top as category a, then we can do the same with our second function and the third using this trick that we are no longer limited to. two categories we can create as many output nodes as we want and learn and divide the input space into as many categories.

It's worth noting that the winning category may not be the best in many cases, as you can see, they can be very close to a category will be declared the winner, but the next finalist may fit almost as well. There is no reason we can't extend this approach to two or more inputs. Unfortunately, it becomes more difficult to visualize. You have to imagine several of these. uneven landscape plots on top of each other and in some regions one will be larger than the others in that region that category associated with that production will be dominant to get a qualitative idea of what these regions would look like you can look at the contours projected on the floor of these graphs, in the case of a multilayer perceptron, all of these graphs are cut at the level and is equal to zero, which means that if you look at the floor of the graph, everything in any shade of green will be a category and everything in any shade of red will be the other category.

The first thing that jumps out about the boundaries of these categories is how diverse they are, some of them are almost straight lines, although with a little movement, some have wilder curves and curves and others cut through space into various regions. disconnected from green and red sometimes there is a small island of green or an island of red in the middle of a sea of the other color the variety of limits is what makes this very powerful classification tool the only limitation we can Looking at it this way way is that all the boundaries are smoothly curved, sometimes those curves are quite sharp, but usually they are smooth and rounded, this shows the natural preference that neural networks with hyperbolic tangent activation functions have for smooth functions and soft boundaries, the goal .

The goal of this exploration was to get an intuitive idea of what types of features and category boundaries neural networks can learn when used for regression or classification. We have seen both his power and his clear preference for softness. We have only observed two non-linear activation functions. logistic and hyperbolic tangent, which are very closely related, there are many others and some of them work a little better at capturing sharp nonlinearities. Rectified linear units or relu, for example, produce surfaces and boundaries that are quite a bit sharper, but I hope it was seeding your intuition with some examples of what's really going on under the hood when you train your neural network.

Here are the most important things to get out with neural networks, learn functions and can be used for regression. Some activation functions limit the output range, but as long as it matches the expected range of your results, it's not a problem. The second neural networks are most often used for classification and have proven to be quite good. Third neural networks tend to create smooth functions when used for regression and smooth category boundaries when used. for classification, fourth for standard fully connected neural networks, a two-layer network can learn any feature that a deep network can learn; However, a deep network could learn it with fewer nodes;

Fifth, by ensuring that the inputs are normalized, that is, they have a mean close to zero and a standard deviation of less than one, this helps neural networks to be more sensitive to their relationships. I hope this helps you as you move towards your next project, happy building, welcome to how neural networks workconvolutional. Convolutional neural networks or convnets or cnn can do some very interesting things if you give them a bunch of images of faces, for example, they will learn some basic things like edges and points, bright points, dark points and then like they are a neural network of multiple layers, that's what you learn in the The first layer, the second layer are things that are recognizable like eyes, noses, mouths and the third layer are things that look like faces in a similar way.

If you feed a bunch of car images down to the lowest layer, you'll get things that look like edges. and then, further up, look at things that look like tires, wheel arches, and hoods, and at a higher level, things that are clearly identifiable as cars. CNNs can even learn to play video games by forming patterns of pixels as they appear on the screen and learn the best action to take when you see a certain pattern, a CNN can learn to play video games, in some cases, much better than a human. , not just if you grab a couple of CNNs and set them up to watch YouTube videos. one can learn objects by selecting patterns again and the other can learn grip types, this then along with some other running software can allow a robot to learn to cook just by watching youtube so there is no doubt that the cnn They are generally powerful when we talk about them.

We do it in the same way that we could talk about magic, but they are not magic. What they do is based on some pretty basic ideas applied in a clever way, so to illustrate them we'll talk about a very simple toy convolutional neural network. What this does is take a two-dimensional array of pixels in an image. You can think of it as a chessboard and each square on the chessboard is light or dark and then by looking at it, the CNN decides whether it is an image of an x. o of an o, for example, above we see an image with an x drawn in white pixels on a black background and we would like to identify this as an x and the o we would like to identify it as an o, so how does CNN do this, it has several steps, what makes it complicated is that the x is not exactly the same each time the x or the o can be changed, it can be bigger or smaller, it can be turned a little thicker or thinner and in each case We would still like to identify if it is an x or an o.

Now the reason this is a challenge is because for us to decide if these two things are similar is simple, we don't even have to think about it for a computer, it's very What a computer sees is this chess board, this matrix. two dimensional like a bunch of numbers ones and minus ones, a one is a bright pixel, a pixel minus one is a black pixel and what you can do is go through pixel by pixel and compare whether or not it matches a computer with a computer, it seems that there are a lot of pixels that match, but some that don't match, and some that don't actually match, so you might look at this and say ah, I'm not really sure if they are the same and they would be because the computer is so literal, I would say uncertain, I can't say they are the same now, one of the tricks that convolutional neural networks use is to match parts of the image instead of the whole, so if you break it down into its smaller parts or features then it becomes much more clear if these two things are similar, so the examples of these little features are little mini-images, in this case just three pixels by three pixels, the one on the left is a diagonal line that slopes down from left to right, the right.

There is also a diagonal line that slopes in the other direction and the middle one is a small x. These are small pieces of the bigger picture and you can see as we go if you choose the right feature and put it in the right place. It matches the image exactly, so okay, now we have the fragments to take it a step further. The math behind matching these is called filtering and the way this is done is that a feature is aligned to the small patch of the image and then one by one. Once the pixels are compared, they are multiplied together and then They add up and divide by the total number of pixels.

To see why it makes sense to do this, you can see it starting at the top left. pixel in both the feature and the image patch, multiplying one by one gives a one and we can keep track of that by placing it at the position of the pixel that we are comparing and we move on to the next one minus one times minus one is also one and we continue going pixel by pixel multiplying them all together and since they're always the same the answer is always one when we're done we take all of these and add them up and divide by nine and the answer is one so now we want to keep track of where that feature was in the image and we put one there, for example, when we put the feature here we get a match with one that is being filtered, now we can take that same feature. and move it to another position and perform the filtering again and we start with the same pattern the first pixel matches the second pixel matches the third pixel it does not match minus one multiplied by one equals minus one so we record it in our results and we check and we do that with the rest of the image patch and when we're done we notice that this time we have two minus ones so we add all the pixels together to make five and we divide by nine and we get a point five five so this is very different from ours and we can record the 0.55 at that position where we were where it occurred, so by moving our filter to different places in the image we find different values of how well that filter matches or how well that feature is represented. at that position, this becomes a map of where the feature occurs by moving it to every possible position, we do a convolution which is just the repeated application of this feature, this filter over and over again and what we get is a nice map across the picture. where does this feature occur from and if we look at it, it makes sense that this feature is a diagonal line that slopes downward from left to right and that matches the descending left to right diagonal of the x, so if we look at our image filtered we see that all the high numbers, one and .77, are along that diagonal, suggesting that that feature matches along that diagonal much better than elsewhere in the image.

To use shorthand notation, here we will make a small x with a circle to represent. convolution the act of trying all possible matches and we repeat it with other features, we can repeat it with our x filter in the middle and with our upward slanted diagonal line at the bottom and in each case the map we get from where that feature occurs is consistent with what we would expect based on what we know about x and where our characteristics match. This act of convolving an image with a bunch of filters, a bunch of features and creating a stack of filtered images, we'll call it convolution. layer a layer because it is an operation that we can stack with others, as we will show in a minute in convolution, an image becomes a stack of filtered images, we get as many filtered images as we have filters, so the convolution layer is a trick The next big trick we have is called grouping.

This is how we reduce the image stack and it's pretty simple. We start with a window size usually two by two pixels or three by three pixels and a step, usually two pixels, just in practice. works best and then we take that window and stride it through each of the filtered images of each window we take the maximum value to illustrate this we start with our first filtered image we have our 2 pixel by 2 pixel window inside that pixel the maximum value is 1. so we keep track of that and then move at our 2 pixel step, move 2 pixels to the right and repeat outside that window, the maximum value is 0.33, etc. 0.55 and when we get to the end we have to be The creative we have does not have all the representative pixels so we take the maximum of what is there and we continue doing this throughout the image and when we are done we end up with a similar pattern but more small, we can still see our high values are all on the diagonal, but instead of seven by seven pixels in our filtered image we have a four by four pixel image, so it's half as big as it was.

This makes a lot of sense if you imagine if instead of starting with a nine by nine pixel image, we had started with a nine thousand by nine thousand pixel image, shrinking it, it's convenient to work with, it makes it smaller, the other thing What it does is group, it doesn't care where in that window that maximum value is. This happens in a way that makes it a little bit less sensitive to position and the way this plays out is that if you're looking for a particular feature in an image, it might be a little bit to the left, a little bit to the right, maybe a little rotated and We'll still pick it up, so we do max pooling with our entire stack of filtered images and get in each case a smaller set of filtered images.

Now that's our second trick, third normalization trick, this is just a step to keep the math from blowing up and keep it from going to zero, all you do here is everywhere in your image where there is a negative value, change it to zero, so, for example, if we look back at our filtered image, we have what are called rectified linear units, which are the small ones. computational unit that does this, but all it does is loop through all the places where there is a negative value, change it to zero, another negative value, change it to zero when you're done, you'll have a very similar looking image except there are no values negatives. just zeros and we do this with all of our images and this becomes another type of layer, so in a rectified linear unit layer a stack of images becomes a stack of images with no negative values.

Now what's really fun, the magic starts to happen here when we take these layers, convolution layers, rectified linear unit layers, and pooling layers, and we stack them so that the output of one becomes the input of the following. You will notice that what goes into each of them and what comes out of them looks like a matrix. of pixels or an array of an array of pixels and because of that we can stack them very well, we can use the output of one for the input of the next and by stacking them we get these operations to build on top of each other, what's more, we can repeat stacks we can do deep stacking, can you imagine making a sandwich that is not just a burger, a slice of cheese, a lettuce and a tomato, but a bunch of layers, double tripper, triple, quadruple, as many times as want every time? the image becomes more filtered as it passes through the convolution layers and becomes smaller as it passes through the pooling layers.

Now the final layer in our toolbox is called the fully connected layer. Here each value gets a vote on what the answer will be, so Take our now very filtered and very reduced size stack of images, we split them up, we just rearrange them and put them in a single list because it's easier to visualize that way and then each one of them connects to one of our answers that we go to. to vote when we feed this into x there will be certain values here that tend to be high they tend to predict very strongly that this will be an certain values here at the end that tend to be very high and tend to strongly predict when we're going to have an o at the end, so they get a lot of weight, a strong vote for the category or now, when we get a new entry and we don't know what it is and we want to decide the way it works, the input goes through all of our convolutions, our rectified linear unit, our pooling layers and comes out at the end.

Here we get a series of votes and then based on the weights that each value votes with, we get a good average vote at the end. In this case, this particular set of inputs votes for an x with a strength of 0.92 and an o. with a strength of 0.51, then here definitely x is the winner, so the neural network would categorize this input as an x, so in a fully connected layer a list of feature values becomes a list of votes , now again, the nice thing here is that a list of votes is a lot like a list of feature values, so you can use the output of one for the input of the next and so you can have intermediate categories that are not your votes final or sometimes called hidden units in a neural network. and you can stack as many of these as you want too, but in the end everyone ends up voting for an x or an o and whoever gets the most votes wins, so if we put all this together we get a two dimensional array of pixels. the results are a set of votes for a category on the other end, so there are a few things we've gone throughhigh here, you might be wondering where all the magic numbers come from.

The things I pulled out of thin air include the features. in the convolutional layers those convenient three pixel by three pixel diagonal lines of the propagation, all of these are learned, you don't need to know them, you don't need to guess them, the deep neural network does this on its own, so the underlying principle behind back propagation is that the error in the response final is used to determine how much the network adjusts and changes, in this case if we knew we were putting in an x and we got a vote of 0.92 for an x, that would be an error of 0.08 and we got a vote of 0, 51 for o, we know that that would be an error of 0.49 actually an error of 0.51 because it should be 0. then if we add all that up we get an error of what should be 0.59, so we What happens with this error signal is that it helps drive a process called gradient descent if there is Another thing that is quite special for deep neural networks is the ability to do gradient descent, so for each of these numbers magic, each of the feature pixels and each voting weight are adjusted up and down by a very small amount to see how the error changes the amount they adjust is determined by how large the error is large error they adjust a lot smaller just a little bit no error they don't fit at all you have the right answer stop playing with as they are fitted you can think of it like sliding a ball slightly to the left and slightly to the right on a hill you want to find the direction it's going downhill, you want to go down that slope, go down that slope to find the very bottom because the bottom is where you have the least error, that's your happy place, so after you slide it left and right you find the direction downhill and you leave it there by doing that many times for many iterations, many steps help all of these values in all of the functions and all of the weights to be set to what is called a minimum and at that point the network is working as best as possible if you adjust any of them a little. behavior your error will increase now there are some things called hyperparameters and these are knobs that the designer can turn.

The decisions that the designer must make are not automatically learned in convolution. Determine how many features should be used, how big those features should be. be it how many pixels on the side in pooling layers, choosing the window size and window pitch and in fully connected layers, choosing the number of hidden neurons, intermediate neurons, all these things are decisions that the designer should take right now, there are some common practices that tend to work better than others, but there is no principled way, there are no hard and fast rules for what the right way to do it is, and in fact, many of the advances In convolutional neural networks they focus on obtaining combinations of these, they work very well now.

In addition to this, there are other decisions the designer must make, such as how many layers of each type and in what order, and for those who really like to go off the rails, can we design new types of layers entirely and slide them in there and Get fun new behaviors, these are all things that people play with to try to get more performance and tackle more complicated problems with cnns. What's really interesting about these images is that we've been talking, but you can use any two-dimensional image or even for That matters three-dimensional or four-dimensional data, but the important thing is that in your data the things that are closer are more related than the things that are far away.

What I mean by this is that if you look at an image, two rows of pixels or two columns of pixels are right next to each other. to each other, they are more closely related than rows or columns that are far apart. Now what you can do is take something like a sound and divide it into small time steps and for each time segment the time step just before. This one and the next one are more closely related than time steps that are far apart and the order matters. You can also split it into different frequency bands from base midrange to highs, eventually you can split them up a lot more than that and again those frequency bands are the ones that are closest together, they're most related and you can't rearrange them, the order matters once you do This with the sound, it looks like an image. an image and you can use convert convolutional neural networks with them, you can do something similar with text where the position in the sentence becomes the column and the row becomes words in a dictionary, in this case it is difficult to argue whether the order matters, That order matters.

It's hard to argue that the words in the dictionary some are more closely related than others in all cases, so the trick here is to take a window that spans the entire column from top to bottom and then slide it from left to right that way captures all of the words but only captures a few positions in the sentence at a time, now the other side of this limitation of convolutional neural networks is that they are actually designed to capture spatial local spatial patterns in the sense of things that are together next. each other matters quite a bit, so if the data can't be made to look like an image then it's not that useful, so an example of this is say some customer data, if I have each row it's a separate customer , each column is a separate piece of information about that customer, like their name, their address, what they bought, the websites they visited, so this doesn't look so much like a photo.

I can take and rearrange those columns and rearrange those rows and it still means the same thing. it's still equally easy to interpret if you were to take an image and rearrange the columns and rearrange the rows it would result in a jumble of pixels and it would be difficult or impossible to tell what the image is about you would lose a lot of information so as a general rule , if your data is just as useful after swapping any of the columns with each other, then you can't use convolutional neural networks, so the bottom line is that convolutional neural networks are great at finding patterns and using them.

To classify images, if you can make your problem look like finding cats on the Internet, then they are a great asset. Machine learning applications have gained a lot of traction in recent years. There are a couple of important categories that have had wind. identifying images is the equivalent of finding cats on the internet and any problem that can look like that and the other is sequence to sequence translation, this can be speech to text or from one language to another, most of the former are done with networks convolutional neurons. most of the latter are done with recurrent neural networks, particularly long short term memory, to give an example of how short term memory works, we will consider the question of what's for dinner, let's say for a minute that you are a very lucky. inhabitant and you have a roommate who loves to prepare dinner every night, he cooks one of three things, sushi, waffles or pizza, and you would like to be able to predict what you are going to eat on a given night so you can plan the rest . your days you eat accordingly to be able to predict what you are going to have for dinner you set up a neural network the inputs to this neural network are a lot of elements such as the day of the week the month of the year whether or not your roommate was in a late meeting , variables that could reasonably affect what you're having for dinner now.

If you are new to neural networks, I recommend that you take a minute and stop by to watch the tutorial on how neural networks work. There is a link below. in the comments section, if you prefer not to do that now and you are not yet familiar with neural networks, you can think of them as a voting process, so in the neural network that you set up there is a complicated voting process and all the inputs like the day of the week and the month of the year go into it and then you train it on your history of what you've had for dinner and you learn to predict what will be for dinner tonight.

The problem is that your network doesn't perform very well despite carefully choosing its inputs and training them thoroughly, it still can't get predictions much better than chance at dinner, as is often the case with complicated machine learning problems, it's useful to give take a step back and just look at the data and when you do that you notice a pattern where your roommate makes pizza, then sushi, then waffles, then pizza again in a cycle that doesn't depend on the day of the week or anything else. , it's a regular cycle, so knowing this we can make a new neural network in our new, the only inputs that matter are what we had for dinner yesterday, so if we know that if we had pizza for dinner yesterday, tonight it will be sushi, sushi yesterday, waffles tonight and waffles yesterday, pizza tonight, it becomes a very simple voting process and it's correct every time because your roommate is incredibly consistent now, if you were away on a given night, let's say you were away yesterday, no You know what was for dinner yesterday, you can still predict what is going to happen.

Be for dinner tonight thinking about two days ago Think about what you were for dinner then, then what was predicted for you last night and then you can use that prediction in turn to make a prediction for tonight, so let's make use of no only from our real information. yesterday, but also what our prediction was yesterday, so at this point it's helpful to take a little detour and talk about vectors. A vector is just a fancy word for a list of numbers if you wanted to describe to them the weather for a given day. You could say the high is 76 degrees Fahrenheit the low is 43 the wind is 13 miles per hour there's going to be a quarter of an inch of rain and the relative humidity is 83 percent that's quite a vector that's the reason why which is useful are vectors list of numbers is the native language of computers if you want to get something in a format that is natural for a computer to calculate to perform operations to do statistical machine learning lists of numbers are the way to go everything is reduced to a list of numbers before going through an algorithm we can also have a vector for statements like it is Tuesday to be able to encode this type of information what we do is make a list of all the possible values that could have in this case all the days of the week and we assign a number to each one and then we go through them and set them all equal to zero, except for the one that is true right now.

This format is called one-hot encoding and it is very common to see a long vector of zeros with only one element being one, it seems. inefficient, but for a computer this is a much easier way to ingest that information, so we can create an active vector for our prediction for dinner tonight. We set everything equal to zero except the dinner item we predict, so in this case we'll do that. be predicting sushi now we can group our uh we can group our inputs and outputs into vectors separate lists of numbers and it becomes a useful shorthand to describe this neural network so we can have dinner yesterday vector our predictions for yesterday vector and our The prediction for the vector today and the neural network are just connections between each element in each of those input vectors with each element in the output vector and to complete our picture we can show how today's prediction will be recycled.

The dotted line means wait. We use it for one day and then we reuse it tomorrow and it becomes our predictions from yesterday. Now we can see how if we were missing some information, say we were out of town for two weeks, we can still make a good guess about what will happen. for dinner tonight we just ignore the new piece of information and we can unwrap or unwrap this vector in time until we have some information to base it on and then just play it forward and when it's unwrapped it will look like this and we can go back like until wherever necessary and see what's for dinner and then follow it forward and play our menu for the last two weeks until we figure out what's for dinner tonight, so this is a nice simple example that now showed recurrent neural networks to show how they don't meet all our needs we are going to write a children's book that will have sentences of the format dougsaw jane period janesaw spot period spotsaw doug period and so on, so our dictionary is small just the words doug jane spot saw and a period and the task of the neural network is to put them together in the right order to make a good children's book, so to do this we replace our food vectors with our dictionary vectors here again, it's just a list of numbers. representing each of the words, forFor example, if doug were the most recent word I saw, my new information vector would be all zeros except one at the dug position and similarly we can represent our predictions and our predictions from yesterday now after training this. neural network and teaching it what to do, we would expect to see certain patterns, for example, every time a name comes up, Jane Dug or Spot, we would expect it to vote a lot for the word Saw or for a period because those are the two words in our dictionary. which can follow a name in a similar way, if we had predicted a name in the previous time step, we would expect them to also vote for the word saw or a period and then by a similar method each time we encounter the word saw or a period that we know. that a name has to come after that, so you will learn to vote very strongly for a name, jane doug or spot, so this way in this formulation we have a recurrent neural network to simplify, I will take the vectors and the weights and I will collapse them. even that little symbol with the dots and the arrows, the dots and the lines that connect them and there's one more symbol that we haven't talked about yet.

This is a squashing function and simply helps the network behave the way it works. of your votes come out and you put them through this squashing function, for example, if something received a total vote of 0.5, you draw a vertical line where it crosses the function, you draw a horizontal line towards the y-axis and there's your squashed version. For small numbers the squashed version is pretty close to the original version but as your number increases the number that comes out gets closer and closer to one and similarly if you enter a large negative number what you will get will be very close to minus one, no matter what you put in, what comes out is between minus one and one, so this is really useful when you have a loop like this where the same values are processed over and over again day after day, You might be able to imagine it. if in the course of that processing, say something was voted twice, it was multiplied by two, in that case it would become twice as big each time and would very soon explode into astronomical by ensuring that it is always less than one but more than minus one.

You can multiply it as many times as you want. You can go through that loop and it won't explode into a feedback loop. This is an example of negative feedback or attenuating feedback, so you may have noticed that our neural network in its current state is subject. For some errors, we could get a sentence, for example, of the form doug saw dot doug because doug votes strongly for the word saw, which in turn votes strongly for a name, any name that could be doug similarly, we could get something like doug saw jane saw dot We saw because each of our predictions only looks back one time step, it has very short term memory, so it doesn't use the information from further back and is subject to these types of errors, to overcome this We take our recurrent neural network and expand it and add a few more pieces to it.

The critical part we add in the middle here is memory. We want to be able to remember what happened many times a few steps ago, so to explain how this works, I'll have to describe it. We have introduced some new symbols here, one is another crush function, this one with a flat background, one is an x in a circle and the other is a cross in a circle, so the cross in a circle is a sum element by element, as it works. If you start with two vectors of equal size and go down each one, you add the first element of one vector to the first element of another vector and then the total goes to the first element of the output vector, so 3 plus 6 equals 9 . then you move on to the next element 4 plus 7 equals 11.

Therefore, your output vector is the same size as each of your input vectors, just a list of numbers of the same length, but it is the sum element per element of the two and is very closely related to You probably guessed that the first element of 18. 4 times 7 gives you 28. again the output vector is the same size as each of the input vectors, now multiplication by elements allows you to do something quite interesting. Imagine you have a signal and it's like a bunch of pipes and they have a certain amount of water trying to flow through them.

In this case we will simply assign the number to 0.8. It's like a sign. Now on each of those pipes we have a faucet and we can open it completely, close it completely or keep it somewhere in the middle to leave it. that signal passes or blocks it, so in this case an open door, an open faucet would be a one and a closed faucet would be a zero and the way this works with multiplying elements we get 0.8 times one equals at 0.8 that signal passed in the output vector, but the last element 0.8 times 0 equals 0. That signal, the original signal, was effectively blocked and then, with the activation value of 0.5, the signal got through, but it's smaller and attenuated, so triggering allows us to control what passes and what gets blocked, which is really useful now for making gates.

It's good to have a value that you know is always between zero and one, so we introduce another squash function which will be represented by a circle with a flat bottom and is called logistic. This function is very similar to the other squash function, the hyperbolic tangent, except it simply goes between zero and one instead of negative one and one. Now when we present all of these together, what we get is still the combination of our previous predictions and our new ones. information, those vectors are passed and we make predictions based on them, those predictions are passed, but the other thing that happens is that a copy of those predictions is saved for the next pass, the next pass through the network and some of them here is a door. right here some of them are forgotten some of them are remembered those that are remembered are added back to the prediction, so now we not only have predictions but predictions plus the memories that we have accumulated and have not yet chosen to forget. now there's a completely separate neural network here that learns when to forget what, based on what we're looking at right now, what we want to remember, what we want to forget so you can see, this is powerful, this will allow us to hold on to things.

For as long as we want, you've probably noticed it. However, when we combine our predictions with our memories, we may not necessarily want to release all of those memories as new predictions every time, so we want a little filter to keep our memories inside. and we let our predictions come out and then we add another gate for it to do the selection, it has its own neural network, so its own voting process, so that our new information and our previous predictions can be used to vote on what they should all be. the doors. should be kept internal and what should be posted as a prediction, we have also introduced another squash function here, since we do an addition here, things might become greater than one or less than minus one, so we just squash them to beware. make sure it never gets out of control and now when we bring in new predictions we make many possibilities and then we compile them with memory over time and from all those possible predictions at each time step we select only a few to publish as the prediction for that moment, each of these things, when to forget and when to let things out of our memory, are learned through their own neural networks and the only other piece we need to add to complete our picture here is another set of doors that allows us In reality, we ignore possible prediction possibilities as they appear.

This is an attention mechanism. It allows you to set aside things that are not immediately relevant so that they do not cloud predictions in memory. It has its own neural network and its own logistics. squash function and its own activation activity here now, long term memory has many pieces, many bits that work together and it's a little bit too much to understand it all at once, so what we'll do is take a very simple example and Review it just to illustrate how a couple of these pieces work. Admittedly, it's an overly simplistic example and feel free to poke holes in it later, when you get to that point, then you'll know you're ready to move on to the next one. material level, so we are now in the process of writing our children's book and for demonstration purposes we will assume that this lstm has been trained on the examples of our children's books that we want to imitate and all the appropriate votes and weights in those neural networks have been learned now, we will show them in action so far our story so far is jane saw dot dog doug so doug is the most recent word that occurred in our history and it is also not surprising that at this time the names doug jane and spot were predicted as viable options, this makes sense, we simply end a sentence with a period, the new sentence can start with any noun, so these are all great predictions, so we have our new information, which is the word doug, we have our recent prediction which is doug jane and spot and we pass these two vectors together to our four neural networks that are learning to make predictions to do so ignoring forgetting and selection, so the first one makes some predictions given that the word Doug just appeared.

You learn that the word Saw is a great guess for the next word, but you also learn that after seeing the word Doug, you shouldn't see it again very soon by seeing the word Doug at the beginning of. A sentence that makes a positive prediction for Saul and a negative prediction for Doug says I don't expect to see Doug in the near future, that's why Doug is dressed in black, this example is so simple that we don't need to focus attention or ignore it, so which we'll skip for now and this Saw and Doug prediction is passed forward and again for simplicity, let's say there's no memory right now, so Saw and Doug are passed forward and then the selection mechanism here. has learned that when the most recent word was a name, what comes next will be the word saw or a dot, so it blocks other names from appearing, hence the fact that there was a vote for no doug is blocked here and the word saw is sent as a prediction for the next time step, so we take a step forward in time.

Now the word saw is our most recent word and our most recent prediction is fed to all these neural networks and we get a new one. set of predictions because the word saw just happened, we now predict that the words doug jane or spot might appear next, we'll ignore and pay attention in this example and carry those predictions forward now, the other thing that happened is our previous . set of possibilities the word saw and not doug that we were holding internally passed to a door of oblivion now the door of oblivion says hey my last word that came uh that happened was the word saw based on my past experience so I can forget about you I know that occurred.

I can forget it happened, but I want to keep any predictions involving names from being forgotten. doug one or negative for doug and so they cancel each other out so now we only have votes for jane and we detect that those pass forward in our selection gate, you know that the word saw just happened and from experience it will appear a name next, so it goes through these name predictions and for the next time step, we then get predictions for just jane and spot, not doug, this avoids the dougsaw doug period type error and the other errors that We saw, what this shows is that short-term memory can look back two, three, many time steps and use that information to make good predictions about what will happen next now to be fair to basic recurrent neural networks, actually they can look back several time steps too but not many lstms can look back many time steps and it has successfully shown that this is really useful in some surprisingly practical applications if I have text in a language and I want to translate it to text to another language.

The lstms work very well although the translation is not a word-by-word process, it is a phrase-by-phrase process or even in some cases sentence-by-sentence process lstms are capable of representing those grammatical structures that are specific to each language and what seems is that they find the higher level idea and translate it from one mode of expression to another. simply using the fragments that we just analyzed, another thing they do well is translate speech to text. Speech is just some signals that vary in the time it takes them and then they use them to predict what text, what word is being spoken and you can use the history, the recent history of the words to guessbetter what will come next.

Lstms are ideal for any information embedded in time. audio Video. My favorite application of all, of course, is robotics. Robotics is nothing more than an agent. Taking information from a set of sensors and then, based on that information, making a decision and taking an action, is inherently sequential and the actions taken now can influence what you feel and what you should do many times in the future. future if you are curious. what lstms look like in math this is this is taken directly from the wikipedia page. I won't review it, but it's encouraging that something that seems so complex expressed mathematically actually creates a pretty simple picture and story, and if you want, delve deeper into it.

I recommend you go to the Wikipedia page. There are also a collection of really good tutorials and discussions on other ways to explain lstms that you may find useful. I also highly recommend you check out Andre Carpathi's blog post which shows examples of what lstms can do in text, you could be forgiven if when reading on the internet you can substitute deep learning for magic and it fits perfectly into all the articles , it's hard to know what it can't do, we didn't get to We talked a lot about that, so the point of this talk is just to talk about a really simple level of practicalities, the summary case where you want to take a nap, deep learning It's not magic, but it's really good at finding patterns, so if this is our brain.

This is deep learning, an owl can fly, a fighter plane can fly, but there are many things that an owl can do and it could be said that it is something much more complex, although what the fighter plane does it does really very well, so deep learning is the highly specialized highly designed fighter plane today we're going to talk about the basics the Wright brothers' plane if you understand the principles by which it works then you can see that it's easy to delve into the finer engineering details , but there are many things that become fighter jets, which we are not going to talk about in detail, but it's good, we can talk about this on a comfortable level.

This is a neuron like all neurons, it has a large body in the middle, a long tail and some arms that branch out here is an artist's conception of a neural network or a group of neurons again large body long tails arms this is a real image of neurons in some brain tissue here the bodies look like dots or spots, you can see long tails, some of which The branch and arms are practically invisible and again an image of brain tissue. Here the neurons are small dots and you can barely see the tails. This is just to give you an idea of how tightly packed these things are. how many of them there are big numbers with many zeros and the crazy thing is that many of them are connected to many more of their neighbors this is one of our first images of a neuron santiago ramona kahal found a spot that could introduce into a cell darken everything under his 19th century optical microscope was able to see this and then draw it with pencil and paper this is old school what you see here although bodies long tails many arms um these we have Let's turn it around because this is how they are normally represented on the networks neurons and these pieces actually have names.

The bodies are called soma. The long tails are called axons and the arms are called dendrites. Let's draw a cartoon. version of them this is what they look like in uh in powerpoint now the way neurons work is a dendrite, you can think of it as antennae or whiskers and they look for electrical activity, they pick it up and send it to the body, the soma. it takes this and it adds it up and it accumulates it and then depending on how fast it's accumulating it will activate the axon and send that signal down the tail, the more dendritic activity there is, the more axonal activity there will be and if you get all the dendrites actually activate, so that axon is as active as it can be in a very simplistic way, a neuron adds things now, a synapse is where the axon of one neuron touches the dendrite of another, that is an artistic conception that can be seen In Ramona Tahoe's drawings, these little bumps or buttons are actually called buttons on the dendrites and they are places where the axon of another neuron touches that, so you can imagine that there is a little connection there.

She will represent that connection by a circle and the diameter of that circle is the strength of that connection big circle strong connection and it can connect strongly or weakly or somewhere in between we can put a number on this connection between zero and one so a medium connection we will call it 0.6 when the axon of the input neuron, the ascending neuron, is active, then it activates the dendrite of the output neuron, it transmits it with modest strength, if that connection is strong, then it transmits the signal very strongly, that connection is one, then when the axon is active, the next dendrite is very active in the same way, if that connection is very weak, say 0.2, then when the axon is active, the dendrite of the output neuron is only weakly activated.

No connection is zero. Now this starts to get interesting because many different input neurons can connect to the dendrites of a single output neuron and each connection has its own strength. We can redraw this by removing all the parallel dendrites and simply drawing each axon and the single dendrite it connects to and the strength of the connection is represented like this with the points that we can substitute in numbers for the weights uh although most of the times uh oh we can also separate surrogates into line weights to show how strongly these things are connected most of the time neural networks are drawn like this and this is what we have. from the super complex slice of brain tissue with many subtleties in its functioning and interconnection to a nice circular bar diagram where each of those bars represents a weight in its current form, you can still do some interesting things that the input neurons can connect to many output neurons, so in reality what you get here are many inputs, many outputs and the connection between each one is different and has its own weight.

This is good for making pretty images. It's also great for representing combinations of things and the way this is done, let's say you have five inputs labeled a b c e in this case this output neuron has a strong connection to a c and e very weak connections to b and d that means that when the input neurons a c and e are active All of this together strongly activates the output neurons b and d doesn't matter because they are only weakly connected, so one way to think about this output neuron is in terms of the inputs being strongly activated, which is why we call it ace neuron and here we have an atomic example of what is happening here, this output neuron represents a combination of the input neurons, these are neural networks in a nutshell, you can do this with any type of input to have a 4 pixel camera very low tech, each of those four inputs is uh one of the top left bottom left top right or bottom right pixels in this particular neural network with strong connections to the top left and top right pixel we have one neuron and one output neuron which represents this bar in the top half of the image so we can combine letters, we can combine pixels to create small images if you are processing text, the input neurons can represent individual words, so in this case we are extracting words from the text, this output neuron is strongly connected to the anterior neurons of the eye and the ball. so we can call it an eyeball neuron similarly, we can have a sunglasses neuron and the child neurons can connect to many outputs.

We could just as easily have a glasses neuron, so delving a little deeper into this, this is a somewhat trivial example to show how these things work. works in practice so there's a guy at the shawarma place who makes shawarma like no one else so make sure and go when he's working there and step back we actually have some domain knowledge here, We know that you have two work schedules. morning off in the afternoon and off in the morning working in the afternoon now if we were to equip this with sensors we would have work in the morning free in the morning working in the afternoon free at night and it might be useful to represent your work patterns in terms of a pair of output neurons that combine them, so this is the network that we would expect to end up working in the morning, off in the afternoon, it's a pattern, off in the morning, working in the afternoon, it's the other route and you can see based on their connection strengths how they combine those inputs here would be the weights associated with them now the question is how do we learn this if we have to go in and complete everything by hand we haven't learned anything it's just a more elegant way to program and requires a lot of hard work, especially if you are dealing with many millions of input neurons, so we want to learn this automatically, so getting started may be a bit counterintuitive, we create our neural network. we have our input neurons, all we choose is the number of output neurons, in this case we will choose two because we know that we are learning two patterns and then we assign weights randomly, we generate numbers randomly for each of these, it is a neural network this completely, you roll the dice, you roll the sticks and whatever lands, that's what you start with and then we start collecting data, we stand across the street and we notice that the shawarma guy on this particular day worked in the in the morning and then I went home it didn't work at night, that means this input is active, we'll say it's at level one, often the morning it's at level zero because we don't observe it working, in the morning it's at zero and off. at night we ran out of one because we observed that too, so the next step is to calculate the activity for each of the output neurons, so in this case an appropriately simple way to do this is to simply take the average of the inputs, so here is this weight. is 0.3 and this weight is 0.1, so the average of them is 0.2.

These neurons do not contribute anything because those inputs are not active. Similarly, we can take the weight between this input and that output. 0.8 and 0.4, the average of those is 0.6. The top neuron on the right has higher activity, that's the one we care about. We ignore all the others. There are a million others. We ignore the rest and focus on this one during this step. The first thing we discover is how bad it is. was perfect if our neural network was perfect it would have an activity of one it would be perfectly aligned with our inputs but it only has an activity of 0.6 so the error is 0.4 the larger the error it is a sign of how much we need.

Adjusting our weights when that error is very small means that the weights actually represent what is happening and we don't need to make any more changes now. The trick here is gradient descent. If there is a magical element in deep learning, it is gradient descent. What you do is review and adjust each of these weights, adjusting them a little up and a little down and see in which direction the error decreases. The idea behind the concept in gradient descent is that weight is a quantity that you can change. a little bit from side to side as you do it, this error will change.

You can think of it like taking this ball, if you move it a little to the left, it has to go up the hill, if you move it a little to the right, it has to go up the hill. you go down the hill and you like the direction, you choose the direction it goes down, you want to reduce that error as much as possible and you take small incremental steps to keep everything numerically stable, so we go ahead and do this. for all the neurons, sorry for all the weights that bind the input neurons to our output and we discovered that yes, we want to increase this because they are not active, we actually have a bias towards low weights, so it doesn't hurt. to decrease, so we'll go ahead and decrease that weight and decrease that weight and increase that when we do that, sure enough, our new activity is seven, so our error went from a point four to a point three, it's a little bit better on representing what we saw, so that was one data point, we go back and do the same thing the next day.

It just so happens that this day he is free in the morning working at night, we adjust the weights and do it day after day. day and eventually the weights will stop changing or slow down the change a little bit, they will stabilize and we will get the system of weights that we originally saw because we were aware of the problem, so this is the Wright brothers version of airplane. how backpropagation training works using gradient descent, backpropagation is a very specific way of doing this that is computationally cheap, skillful and fast, and you get your jet engine ininstead of flapping wings, so this is the underlying mechanism by which it works.

So what we just saw was a single layer, we have inputs, we have outputs, each output is a combination of things that were in the previous layer. Now there's no reason we can't turn around and take that output layer and turn it into inputs. for the next layer and do that over and over again if a network has more than three layers or that's what we call deep, some have more than 12. In some recent research at Microsoft, their deep neural networks would cost over a thousand dollars, there's no theoretical reason to limit the number of layers you have, it just depends on the details of your problem now, how deep does it get you?

Why is it deep? especially if your input neurons are, say, letters in the alphabet, your first layer outputs sorry, this is a deep neuron. network with all connections emitted for clarity, so these are your inputs, this is your first layer of outputs, they are combinations of those letters. Each level you go up you get combinations of what happened in the previous level, so when you get to your second level of results, maybe you get words in the English language, if that's what you're training for. On the top layer, you get combinations of words, short phrases, etc., and you can do this as deep as you want, so there is variety.

Of the things you can learn with deep neural networks, a very popular one is images if you take pixels as input and sample them, instead of looking at the shawarma type schedule you are looking at individual images as your training data establishes what you start learning later. for a while they are these little representations of short lines, dots and patches, these are the primitives of an image, if you train on images of faces then your first layer output, sorry your second layer outputs start to look like eyes , noses, mouths, chins and your Third layer outputs start to look clearly recognizable as faces Similarly, if you train on cars, your second layer outputs start to look like wheels, doors and windows and your third layer looks like cars so it's cool that we didn't have to go in and rotate any of those weights, this I just learned from looking at a lot of images you can also do it in color images.

Here's an eight output, some of the output neurons of an eight layer neural network, and as you dig deeper, you can see things. which are clearly recognizable and quite complex, you get spiders, rocking chairs, candles, potato chips and teddy bears. You can also enter information about musical artists, so here is some research where output neurons were learned based on information about artists and then their representation was plotted. Based on how similar they were in those output neurons, we see that things like Kelly Clarkson and Beyonce are similar here, which is also not too far off from Taylor Swift and Avril Lavigne, whereas if we go up here we get the flat keys of Weezer. modest mouse presidents of the united states of america all in the same neighborhood this is a network that didn't know anything it still doesn't know anything about music but because of the data it gets from the input neurons it is able to group these things together appropriately it finds patterns and then find the things that best fit those patterns.

It turns out that you can take Atari 2600 games, take the pixel representations and feed them as the input neurons learn some fun features and then combine that with something else called reinforcement learning that learns the appropriate actions. and when you do this for a certain class of games, it can learn to play them much better than any human player who has ever done it or is likely to, and it turns out that you can take a robot and let it watch YouTube videos on how to cook. and loses uses a pair of deep neural networks, one to interpret the video, another to learn to understand its own movements, and then uses those pairs to run some other software to cook based on the video representations it sees, like this. which while it's not magic, it's It's cool, you can do some things, so while you read literature, you read popular articles about this, you can play bingo, there are some buzzwords, some popular algorithms, you can think of a lot of these like the model numbers you know. several fighter jets out there, but when you see any of these terms, if you want, you can mentally do the deep learning substitution and apply what you know about the Wright brothers' plane and most of it will still be accurate, so The bottom line is it's good for learning patterns, it doesn't do anything, but it's pretty good for learning patterns.

I'm excited to be able to talk about two of my favorite topics at the same time, artificial intelligence and robots, they go pretty well together, but they are not the same. you can definitely have one without the other first a few caveats I'm not going to give you the answer to human level intelligence I would if I had it but I don't below these are my personal opinions definitely not those of any current or former employer and They do not reflect those of many experts in the field, take them with caution if they are useful, they are welcome and if they are not, also discard them from this story that I am going to tell you is not rigorous, it does not have equations, it is conceptual and I was only trying to start a discussion and encourage ideas throughout this presentation, we will talk about intelligence, the functioning and the definition that I propose is that it is a combination of how much one can do and how well one can do it theoretically, you could be extremely good at one thing and not being so intelligent, I could also do many things, but do them all very poorly and neither is intelligent intelligence the combination of being able to do many things and doing them well.

This is a functional definition of intelligence. There are many other potential definitions, some of them can be be measured experimentally and others not a particular definition has the advantage that if we wanted to reduce it to a measurable set of tasks we could because it is, at least in theory, observable. This allows us to have a scientific discussion about machine-level intelligence. It allows us to formulate hypotheses that we could potentially refute. And it allows us to compare the relative intelligence of two separate systems. agents in short, it is practical and useful, no doubt, some will find it philosophically unsatisfactory, it is a separate conversation, but I would love to have it in another forum using fake mathematics, we can say that intelligence equals performance multiplied by generality, it is just false because we have not done it.

We haven't defined performance or generality yet, but assuming we do, you can imagine graphing them and putting them on a set of axes like this, even though we haven't defined it quantitatively for human performance. What I would like to propose is that human performance is the level at which a human expert can do something. This can be measured in many different ways. It can be error rates or the time it takes to execute a task. The time it takes to learn a task. the number of demonstrations needed before learning a task the amount of energy expended when one performs a task the subjective judgment of performance by a panel of human judges the speed with which someone does something There are many aspects of performance and I'm not going Let's try it.

Specify or quantify them all here I only list them to illustrate that we are considering performance in a broad sense rather than in a narrow sense of machine learning accuracy classification if we consider human-level performance to be something of a line of base we can place it on our x axis and then divide the rest of the axes into equal increments. We will make this a logarithmic scale to allow us to compare a very wide range of performances. equal steps along this axis represent equal multiplicative factors of change at the human level the generality is the set of all tasks that humans can do and have undertaken, these include things like writing stories, baking cakes, building cities, collecting and transmitting information around the world and even exploring the origin of the universe is a very broad scope. set of activities that we can represent the generality at the human level on the y-axis approximately this is the set of all the tasks that a human or group of humans can perform we will also make the y-axis logarithmic, so an equal interval is a factor of 10 in performance, either multiplied or divided depending on whether you are moving up or down so that human intelligence can be theoretically represented by the area implied by this point.

I just want to point out that there is no reason to believe that machines do not surpass human performance in In some areas, humans have a number of limitations that are built into the way we have achieved our intelligence through evolution. Things that may have been very useful at one time or may be useful in general terms, but now may not be useful in pushing the limits of intelligence things like limited attention, instinctive impulses, how every part of us gets fatigued, and a number of cognitive biases, all of which puts some distance between us and perfectly rational or perfectly efficient or optimal behavioral machines, by comparison, have a richer environment in which to take root. we don't have to evolve, we are trying everything we can to encourage them now, you can imagine on the performance generality axis another agent that can perform a much larger set of tasks than humans, although it does them all worse than a human, so could look like this, the area under that triangle, the general intelligence would still be comparable to that of a human, so we would call that level of intelligence human.

You can also imagine an agent that can perform a subset of the tasks that humans perform, but does them. It is much better that the area under that rectangle is also approximately the same as that of humans. Again, the human level of intelligence now, if we take the set of all these agents that have approximately the same area under their intelligence rectangle, then we get this curve that represents human level intelligence, any agent that falls along that curve would be comparable to a human, any agent down and to the left, subhuman intelligence and any agent up and to the right, superhuman intelligence.

Now let's look at some agents you may be familiar with. and see where they fall in this scheme chest pain airplane in chest computers have been around for 30 years now um, who is a world chess champion? Computers have been around for almost 30 years. IBM's deep blue beat Gary Kasparov in 1989. This was a task. That people assumed that big computers would find it very difficult to involve planning strategies, thinking about mental models of their opponent, seemed to encompass the very peak of human cognition and seemed unlikely that such a thing could be done by a sophisticated calculator, but He did it and now a chess program running on your phone can do almost the same thing that Deep Blue did.

The current state of the art is a program called stockfish which has an elo rating which is like a chess skill score of 34 47. compare this to the greatest human, greatest human player of all time, magnus carlsen, who achieved a 2882, the program and the human are not even comparable, they are not even close. It's worth noting that stockfish is an open source project with code freely available and has a number. from contributors spread all over the world now in terms of generality stockfish understands the rules of chess and in fact understands them very well has a lot of human-integrated hints, tips and tricks that are specific to chess uses a point system to evaluate the pieces depending on where they are and at what stage of the game a complete end games table is used once there are only a few pieces left on the board the number of possibilities for how the game will play out are so few that it can be enumerated completely, for so there is no search, there is no solution, basically you just have to look up at a giant table and figure out what to do next, there are manually adjusted strategies for each phase of the game and then what the computer does is use tree search to evaluate future moves each move choice is a branch of this tree and you can look and say for each move what is the likely outcome for each of those outcomes what are the possible moves my opponent can make for each of those outcomes then what are my possible answers and by going all the way down this branching tree and looking at all the possibilities, the program can figure out what its best options are now, one of the things that makes the stockfish so smart and good at their games, it's very good pruning this tree and ignoring movements. which are unlikely to lead to a good ending and we won't explore them very far, but all in all this program is pretty useless at anything other than chess, it's the opposite of generally, sothat in our plot we would place chess well above the level of human performance, but also very low on the generality axis now compare it to uh, also go to a board game, if you are not familiar with it, they have a grid of 19 for 19 and the opponents take turns placing white and black stones.

The rules are actually, in some ways, simpler than chess. On each turn you choose a junction at which to place a stone. Now the strategy, however, some argue is even more complex and, more importantly, more subtle. What is certainly true is that there are more possible board configurations where chess has an eight by eight board. has a 19 by 19 board where each chess piece has a small prescribed number of moves, any stone and voila can be placed in any open joint, so when doing a search in the tree the number of moves explodes much more faster than in current chess despite these two.

Years ago, alphago, a program created by Deepmind researchers, defeated Lee Sidol, a nine-dan professional player. The nine-dan professional would place him among the stars of the Go world. A later version, alphago master had an elo rating of 48.58, compare this to the highest rated human player, Park Jung-Won, who has a rating of 66-69, so he has not only beaten top players of the world, but now it already has a very healthy margin, as we mentioned, the program understands the rules of Go or knows the rules. Go and use tree search to evaluate moves because there are many possible board configurations, however, you can't memorize them all, so you use convolutional neural networks to learn common configurations and most importantly, you can see patterns that are not They repeat exactly but yes.

Similar to what we've seen before, this is an important innovation and we'll get back to convolutional neural networks in a few minutes. It also uses reinforcement learning on a human game library to learn which moves are good. Reinforcement learning in very simple terms is looking at a configuration, an action and the result, and learning the pattern for a given configuration, if I act, good things happen if I act b, bad things tend to happen after learning those patterns, the next time I see the configuration a, I can take the action that leads to good things, um, so using reinforcement learning on a human game library then start an alphaphage and let it learn from the human game history and knowledge humans to begin with, but like dried fish, it's useless for anything that doesn't work. it has some tricks that allow it to be more general, it is still very limited in its applications, so in our plot we could put it here again, far exceeding human level performance, but still very low on the generality axis.

Now let's jump to a completely different category um image classification, there is a wonderful dataset called Imagenet which has thousands of images classified by hand by humans into a thousand predefined categories. These categories include household items, cars, doors and chairs, includes animals, giraffes, rhinos, also includes For example, many different species of dogs, so it is not a trivial thing to take an image and categorize it accurately and, of course, In fact, a typical human score on this is around five percent error, about one in 20 images a human will classify incorrectly, so it's a difficult task. In 2011, a large-scale visual recognition challenge was started where teams pooled their algorithms to classify these images and in 2011, the best one got a 26 percent error, so about one in four images was classified. incorrectly, still three of them. out of every four was correct, which was a pretty amazing performance each year, this error rate dropped by about half, which is a pretty amazing rate of progress.

Finally, in 2015, the error rate was lower than human, so we had a computer program that classified the images better than a human. Now, in 2017, in one of the most recent competitions, more than half of the teams were less than five percent wrong, so machines now routinely outperform humans at this task. It's pretty impressive in terms of generality. This task is definitely more difficult, more general, more challenging than a table. There are more variations in the game, more possibilities to access it. It uses convolutional neural networks, which are a deep neural network architecture specifically designed to find patterns in two-dimensional sets of data, such as pixels or squares on a chess board or go board.

You are very good at finding patterns that could be represented visually. This is good, but it has been shown to break easily outside of the set of images you train it on. To give an example of this, if you look at the images on the right in the left column we see soap dispensers, praying mantis, a puppy, these are all images that were correctly categorized by a convolutional neural network with the addition of a bit of distortion shown in the middle column, which you are seeing correctly is just a bit. a little noise, but the gray hasn't changed at all, visually it shows us the images on the right, they look identical or very similar, you might be able to see a little warping and distortion there, but for some reason, uh, Convolutional neural networks with confidence.

We predicted that these would all be ostriches, so this is not to say that they are not powerful and good, but they see something different than what we are seeing, they are not learning to see in the same way that we see since the fragile nature of convolutional neurons . Networks have been demonstrated in other ways on some images, changing a single pixel, the correct pixel to the correct value, can change the way that image is classified. Others have discovered that it is not even necessary to enter the digital domain, you can wear carefully designed stickers. and attach them to something and have that object be confidently classified as a banana, whatever it is and in my favorite demo a physical plastic turtle was rotated and from all directions the convolutional neural network confidently predicted that it was a turtle and then , after repainting a different pattern not symbolic or representative of anything, but carefully chosen that same convolutional neural network categorized it as a gun these examples show that, at least as it is currently done, our generality of image classification is not exactly where we would like it to be, definitely higher than human performance but classifying Imagenet is a much more limited task than it might seem on the surface, so we will place it quite low on the generality axis.

Now here's a really fun example of video game performance, so again the folks at deepmind put together uh deep q learning or Deep Reinforcement Learning Architecture for playing video games. We'll talk more about what that is in a second, but what they did was they took 49 classic Atari games and let the algorithm just look at the pixels and make random moves, and the algorithm didn't do that. I don't know if this was supposed to be a move after a move to the right or a jump or a shot, it just took moves and then used reinforcement to learn from the result and then learn this pattern of oh when I see this and do this. something good happens or something bad happens or something neutral happens after doing that for long enough he learned the patterns that allowed him to choose the right thing and in 29 of these 49 games he did it at human expert level or higher and that was great. impressive, so it's not just about looking at a picture and saying this is a cat, it's about looking at a picture in the moment and saying that, for this particular case, the right thing to do is to jump and then when you jump, that changes the image and then having to do it. respond to the new setting and do that over and over again and do it better than a human now the other part of this is that there were 20 games where he did worse than a human so after using convolutional neural networks to learn the pixel patterns for which this is perfectly suitable because the pixels are big and thick, there is no noise, they don't change, the patterns are clear, so what the algorithm sees is very close to what we as humans see and After college, reinforcement learning is used to learn. what actions to take in each situation 20 of these games could not match human performance and the pattern among those games is that they tended to require longer term planning, one of them was ms pac-man and if you have ever done it When As you play, you know that you are trying to eat all the points in a maze while avoiding the ghosts that are chasing you.

It involves planning routes several turns in advance anticipating where the ghosts will be and many things that you cannot get from a single snapshot and without thinking in several steps in its current state, this algorithm did not do that and in fact, the game that had the worst results in a game called Montezuma's Revenge required much more extensive planning to go to one place and grab an object in order to go to another place and open a door, and the computer just couldn't make those connections, so we'll add video games to it. our plot here again, more general than image classification, the stakes are higher, the task is broader, and the performance is approximately human level.

Now you may notice a pattern here, these fit roughly into a line or curve and we'll add see this pattern continue looking at machine translation taking text and changing it from one language to another if you've ever gone to an online translator and typed a phrase or copied a phrase from a language I wasn't familiar with to one I was You'll probably notice that the translation is surprisingly good at capturing some of the sense that even five years ago it was science fiction to be able to do this in a reliable way. You'll probably also notice that the result is nothing more than a native language.

The speaker would probably once say, so okay, it's definitely going in the right direction, but right now it's far from perfect. What's really impressive to me about this is that the state of the art and language translation spans over a hundred languages and instead of having models to translate from each language to each language, all of these languages are translated to a super intermediate representation plus it can then be translated back into any of these over a hundred languages so it's an all to all translator so the scope of that is really impressive now to do this it uses long term memory , lstm, which is a neural network architecture, and actually uses several deep neural networks together, one to carefully ignore parts of the input, one to choose what to remember, and one to choose what to forget when to choose what to pass, there is quite a bit of computation involved and this architecture uses several levels of those, even so, the amount of effort and computing power that goes into this in general, is if we use one of our metrics like efficiency, it could be considered a small success, also Of the inaccuracies, it's worth noting that this uses an attention mechanism, which is why I mentioned attention earlier as a possible limitation of human performance, but it also proves to be a really useful tool.

When you're dealing with a lot of information, it's too much to analyze in depth, so by pre-filtering and focusing on what's most likely to be useful, an algorithm can be much more efficient in the way it does it. handles, so the machine translations are amazing. performance is still below human and for the wildly ambitious scope, a small step forward on the generality axis, a small success on its performance axis, translation is still a very small part of all the things they do humans, but I would definitely say This is more general than playing video games and now looking at recommenders, so if you think about your last experience with a real time online retailer, probably the recommendations you received were maybe one in 10 was really very relevant, some of the others were close. but they almost failed and some of the others were obviously out of left field so this is still pretty cool as it's a tall order if you can imagine like when there were video stores going to a video store with your friend and trying to Guess what your friend, even a friend you knew very well, would want to see on a given night.

You know it would be hard to do better than one and three or one and four, so you know that one in ten isn't terrible or just Generally speaking, it's common to assume among these algorithms that the order doesn't matter, so you just look at everything. what you bought today yesterday last year and don't think about how these things are related or how many you could have or how many. what you might need or how something you previously purchased might be related to what youyou might need tomorrow, it simply analyzes what people bought in the past, what they bought together, and also doesn't accommodate the fact that your selections may change. over time, so even if you bought a jar of mayonnaise a year ago and then another six months ago and another a few months ago, you may not register the fact that your preference has changed.

One of my favorite examples came from Jack. raynor Twitter user who said dear amazon, I bought a toilet seat because I needed a need, not a want, I don't collect them, I'm not a toilet seat addict, no matter how tempting you email me, I'm not going to think, oh, go on, then just one more toilet seat, I'll treat myself to the recommenders doing well compared to humans and I'd say the knowledge of the world needed to do really well is pretty deep, so which we will increase it on the generality scale, but it takes a Performance Impact Now we come to the robots, finally something physical colliding in the world.

Driverless cars' performance is impressive. In general, the accident rates of self-driving cars driven per mile are lower than those of humans and this is quite surprising when you consider all the things a car has to deal with construction pedestrians cyclists changing weather conditions driving conditions road are not perfect but they are surprisingly good um now in terms of generality there are a few things that make driverless cars less general than they might seem at first and in fact the biggest trick to making them successful is to reduce the difficulty of the task and therefore reduce the necessary generality of the solution, so one of the things that happens is that, especially during training, humans still have to take care of some things.

Challenging situations when the human gets restless or the car tells him that it doesn't know what to do, then it falls back into the hands of the human and although driving is still quite a complicated task, it is still very simple compared to, for example, walking on a rough terrain while eating a bagel. and walking a leash-pulling dog there is a lot more to consider there and it is much more difficult than a statically stable car on four wheels on a road that is flat, mostly straight and mostly marked and where the rules are well prescribed to further simplify the task and narrow the scope of what needs to be learned.

The driving style of autonomous vehicles tends to be cautious. They definitely don't tend to speed up. They tend not to tailgate or turn aggressively or do anything else that many human drivers do. It is absolutely good practice and should be praised and be a role model for everyone, but what that means is that the raw driving skill required by an autonomous vehicle is generally less than that required by a human being and also It should be noted that the solutions are custom designed to drive the selection of sensors, the algorithms used to process them, the way everything is put together, it is not updated on the fly, it is collected, it is evaluated by humans and then, very carefully and deliberately, the heuristics, the rules behind how it is interpreted and processed are then updated, tested and released again.

This makes sense for implementing anything that has such important consequences as a car, but from a machine learning point of view it means that the solution is actually not as general as it seems, it is very specific to a certain car with a certain set of sensors and sometimes even in a given environment, some of some self-driving cars, at least the early failures had to do with their use in climates they were not familiar with, for example, until their training data set spans all the conditions in which they will be implemented will be even tighter than those of human drivers, so considering all these things in the task of driving in general, I chose to rate self-driving cars as underperforming than humans, even thus, their physical interaction and their interaction. with other people in other cars, so there's a lot going on and it's definitely more complex than machine translation or even any recommendations.

Now humanoid robots are the pinnacle of cool applications. If you haven't already, go online and search for robots that do backflips. and check it out when you see something like this it's easy to jump and believe that robotics has been solved, like when a robot can do physical feats of stunts that I can't do then I mean it's done I'm ready to call it and um, yeah, it just puts a smile on my face that I can't erase now in terms of generality. Do another falling robot search and you'll see a montage of really funny robot shorts. trying to do very simple things like opening a door or lifting an empty box or even standing and they really struggle with this because the systems are very complex because the hardware and the sensors have a lot going on and because most of these are deployed As research projects, most of these activities are quite codified and quite brittle, they make a lot of assumptions about the nature of the hardware, what's going on, the nature of the environment, and if any of these assumptions are violated, the performance of the robot fails. . as a result, in plotting them here, the generality in the sense that the kinds of things that they have experienced together are now becoming a non-negligible fraction of the things that humans can do, you know, maybe it's 0.1 , maybe it's 0.01 somewhere. in the middle, but a surprising set of things that can be quite difficult, but the performance is sometimes still ridiculously low compared to the human level. um, we can compare humanoid robots as agents with humans and see that it's much less interesting here, our trend now is quite a bit.

It's clear that there's a thick line here that runs roughly parallel to the line of human-level intelligence, as solutions tend to be higher performing, they also tend to be less general, and vice versa, but it's rare that we take big steps toward line of human intelligence. This is what I think is great, this is the goal of this talk. There is an example that I would like to show before we come to the conclusion, which again comes from the deep mind, a program called alpha zero, so alpha zero is like alphago. Except everything you know about Go has been removed, you don't know the rules of any game, now you only see visual patterns, try actions, and learn to see what succeeds and what doesn't.

The way it was used is you can think of a brand. new instance of alpha zero how to be a baby in terms of play and two alpha zero babies were created and they started playing with each other, one was allowed to learn and the other was not, so the one who learned gradually improved a little by stumbling. in some good moves by accident until he became a good beginner player of the game, then he cloned one of the two learned that the other didn't and they played and played until one became an intermediate player of the game and repeated this process. of cloning himself and playing with one learning and the other not and he used his intermediate steps as a scaffold to get better and better and it turns out that by using this approach with go in four hours he was as good as the best human player and in eight hours he had beaten the best computer before, his uncle alphago um, because he didn't create any rules of the game.

He was also able to learn chess and beat the best current chess program, stockfish, and another board game called shogi, and be the best current shogi. game program, all of which outperform humans by a wide margin, so this is great because it offers better performance and is more general, not specific to any board game and presumably if there were other board games that had grids two-dimensional rules and a set of rules that wasn't wildly different, I could learn to play them too, so generality and performance, so what we have now is a point that is further to the right, higher performance and higher up, greater generality than the original of which it was. so this is an actual increase in the area under that rectangle an increase in intelligence this is the direction we want to take so it's worth taking a step and taking a moment and thinking about what it is that allowed us to take a step in this direction well alpha Zero made a lot less assumptions about what was happening and was also able to practice as many times as necessary through self-play.

General assumptions are what prevent generality and enable performance, so if I expand my knowledge of the rules of chess, I can take advantage of them much more quickly, but it also prevents me from doing anything other than chess, so if I give Going back to that by making less assumptions, I mean it takes me longer to do something, but it means that maybe I can learn how to do it. do more things, so some common assumptions the sensor information is noise free, we have ideal sensors, that makes sense if we are playing chess, when we feel that a piece is in a certain square we hope so, but if we are dealing with, for example, a self-driving car maybe there's a mud stain on the camera maybe the lidar calibration is a little off we can't assume ideal sensors when interacting with the physical world there are too many things we can't control other Common assumption is determinism, that is, when I perform an action, I know it will have the same result every time.

It makes a lot of sense when I have a board game. It makes sense if I'm classifying images. If I say that an image is an image. of a cat, I know it will be labeled as a cat image, right or wrong; However, if I am a humanoid robot and I take an action to reach a doorknob, the motor may not work the way I expect, my feet could slip on the terrain, I may have unforeseen challenges to my balance, the action may not turn out exactly as I expect and I need to be able to adapt to this, another very common assumption, unimodality, all sensors are the same type, so this is an assumption in convolutional neurology.

Networks, for example, are excellent for incorporating a two-dimensional array of information that is all of the same type. They are all pixels where everything is squares on the board. A general solution does not need to make this assumption. Another stationary assumption. This is very common. This is that the world doesn't change the things I learned yesterday are still true today the things I learned five minutes are still valid right now, now we have to make some changes, sorry, we have to make some assumptions about continuity, otherwise what I learned yesterday It's no use to me, but we also have to take into account the fact that the world changed a little, maybe the lubrication in my ankle joint is a little low, so it will respond differently than yesterday, maybe there is clouds covering the sun, so the lighting conditions I learned to operate in yesterday have also changed and I need to be able to adapt to that, another common assumption is independence, which is that the world doesn't change because of what I do to it.

Physical interaction violates this completely if I am a robot operating in a house and I bump into a chair and slide. If it is six inches on a side, then any map you made of that house will need to be changed a little. I have changed it myself. If I take a cup and move it from this table to that table, I have changed the position of that cup. the things I do change the world and I need to keep track of that and any algorithm I use needs to be able to take that into account, another common assumption, ergodicity, everything I need to know about how I operate, I can feel it right now.

It's a common assumption, also known as Markov assumption, but it's also commonly broken in physical interaction, for example if I can sense the position, that's great, but that doesn't tell me anything about the speed and sometimes I need Know the speed to know how to respond. Another assumption that is very common is that the effects of my actions become evident very quickly. This is something that is not true, for example in chess, where the initial move will affect whether I win or not. Many times, steps away, there are different tricks. to handle this in chess, for example, assign point values to intermediate positions of pieces on the board, but in physical interaction it is much more difficult to do this, knowing that given a set of actions I take at the moment, it is more likely that results in something. that is desirable within five minutes or within a day all these assumptions are very common in the algorithms that are currently used and that we call ai these algorithms are not enough to achieve human level intelligence these assumptions will prevent them from doing so, so one thing What all these assumptions have in common is that they do not hold when working with humanoid robotics or, indeed, with any robot that physically interacts with the world, so my proposal is that focusing on physical interaction isa great way to force us to confront these assumptions to discover which ones we can bend, discover which ones we can avoid all together, and push us to create algorithms that are less brittle and capable of accommodating a much more general set of tasks that will then get us one step closer. to human-level intelligence when you hear about artificial intelligence about half the time people talk about is convolutional neural networks, understanding how they work is really helpful to take a peek behind the curtain at the magic of artificial intelligence, so let's go To go through it in some detail, convolutional neural networks take images and from them they learn the patterns, the building blocks that make them up, so, for example, at the first level of this network you can learn things like segments of line that are at different angles and then in later layers, they are integrated into things like faces or car elements, depending on the images you train the network with, you can combine this with reinforcement learning algorithms to get algorithms that play video games, They learn to play or even control robots to dig.

To see how they work, we'll start with a very simple example, much simpler than all of these: a convolutional neural network that can look at a very small image and determine whether it's an x image or an o image, just two categories, so, For example, this image on the left is an eight by eight pixel image of an x. We want our network to classify it as an x. Similarly to the image of the o. We want the network to classify it as an o. Now this is not entirely simple because we also wanted to handle cases where these inputs are of different sizes or they are rotated or they are heavier or lighter and every time we want it to give us the correct answer, a human has no problem to look at them to decide. what to do, but for a computer this is much more difficult when it comes to deciding whether these two things are the same.

What it does is go pixel by pixel. Black pixels can be minus one. White pixels can be plus one and it will be compared. They pixel by pixel find the ones that match and here the red pixels are the ones that don't match, so a computer looking at this would say uh no, these are not the same, they have some matches but they have many that don't. match, so the way that convolutional neural networks do this, one of the tricks they use is that they match parts of the image so you can look at them and move them around a little bit, but as long as the little bits keep matching, then the overall image It's still considered a pretty good match, so these little chunks might look like this: We'll call them features.

You can see the one on the left looks like a diagonal arm of the x that leans to the left. The one in the middle looks like the center of the The math behind finding this match by applying features is called Filtering is pretty straightforward, but it's worth walking through how it's done if you line up the feature on the image patch you're interested in, multiply pixel by pixel, add the values. and then divide by the total number of pixels. This is one way. To do it here, for example, we start with this feature of the x arm tilted to the left, we align it with this arm in the image and we start with the top left pixel and we multiply both values one by one, now it equals one because we begin. with the top left pixel we can keep track of our responses here, so this pixel when multiplied equals one, the top center pixel is minus one in both the feature and the image, so minus 1 times - 1 is also equal to 1 so when you multiply them and you get a 1 which indicates a strong perfect match and we can continue to do this throughout the entire feature and the entire image patch and because they are a perfect match, each of these matches will come back as one and then to find the overall match, we simply add these nine, divide by the total number, which is nine, and we get a match of one.

Now we can create another array to keep track of how well the feature, when placed in this position, matched our image, so this average value is one, we'll put one right there to keep track of that. You can see what it would look like if we moved this feature and then aligned it to a different patch, so let's say we move it down to the center of the x and we go pixel by pixel and find what matches and after a few pixels we find one that does not match, we end up with a minus 1 multiplied by a plus 1 which gives us a minus 1, this indicates a non-match for these pixels and as we go through the rest of the function we can see that there are a couple of pixels that do not match , so here, when we add them together and divide them by nine, we get a number that is less than one point. five five so it indicates a partial but not perfect match, turns out you can go through and do this for every possible location in the image, you can split it into every possible image patch, compare the feature to each one and this is what you would get in In this particular case this is what convolution is: it takes a feature and applies it to every possible patch in an entire image and you can see here why it's called filtering.

What we have is a map of where this feature matches the image that you can see a strong cluster of pluses on the diagonal line from the bottom right to the top left and then smaller values everywhere else, so it's a leaked version of the original image showing where the feature matches. You can do this. We can represent this with this. uh notation, we just invented this little convolution operator for shorthand and we can do this with our other features as well. We can see where our x center coincides. Not surprisingly, it coincides strongest in the center of the image.

We can see where our tilted right arm coincides. and it is not surprising that it coincides along this diagonal from the bottom left to the top right. We have three filtered versions of the original image, so this is what a convolutional layer does in a convolutional neural network: it has a set of features and it can be three or thirty or three hundred or three thousand but it has a set of features and it takes the original image and returns a set of filtered images, one for each of the features, this is how we will represent it, that is the number one ingredient in convolutional neural networks, that is the special magic sauce, the special trick that you get from a not exact match and the algorithm is able to remove it.

Well, it's not a perfect match, but it's still a pretty good match because it does this convolution and it moves. the feature goes through the image and finds everywhere where it could match another part of this is called pooling, so we take our original image and now we have a stack of images. What this step does is reduce it a little. We start by choosing one. The window size is usually two or three pixels, usually two pixels have been shown to work well and then it loops through this window through the filtered images and then from each window it takes the maximum value it sees, for what this is called max pooling to see how. this works, we start with one of our filtered images, we have our window that is two pixels by two pixels and inside it the maximum value is one, so we create another small array to keep track of all our results and we put a 1 in it . and then we pass it through our step which is 2 pixels look at the window choose the maximum value in this case it is 0.33 record it and we go again and continue doing this recording the maximum value each time throughout the image and when we are done , we have what if you squint looks like a scaled down version of the original, we still have this strong set of plus on the diagonal from the top left to the bottom right and then everywhere else it's less than that, so it keeps sort of the original signal but it reduces it, selects the high points and this gives us a smaller image but still similar to the original and we can represent it with this little shrinking arrow, we can do this with each of our filtered images and again You Let's see that, very roughly, the pattern of the original is maintained, so in a pooling layer a stack of images is converted into a stack of smaller images.

Now the last ingredient we need is normalization, so this prevents the math from breaking down by taking and adjusting these values just a little bit. a bit, it takes everything that is negative and changes it to zero, this prevents things from becoming unmanageable as you progress through later layers. This function is called a rectified linear unit. It's a fancy name for something that simply takes everything that is negative and turns it into zero. 0.77 is not negative, it doesn't touch it, but a minus .11 is negative, it just brings it up to zero and when you've gone through all your images and you've done this, all your pixels and you've done this, this is what you have. so everything that was negative is now zero, so just a little bit of normalization, some conditioning to keep things in good numerical behavior.

An image stack is converted to an image stack without negative values. Now you can notice that the output of one layer looks like the input to the next there are always arrays of numbers, an image and an array of numbers are the same, they are interchangeable, so you can take the output of the convolution layer and feed it through the rectified linear unit layer, feed it through the pooling layer and when you're done, you'll have something that's had all these operations done to it and you can actually do this over and over again and this recipe you can imagine her making like a Scooby-Doo sandwich with all these different layers over and over again. and in different orders, um and uh, some of the most successful convolutional neural networks are like groups of these discovered accidentally that work very well, so they are used over and over again, so that over time each convolution layer filters a series of characteristics. each rectified linear unit layer changes everything to be non-negative and each pooling layer reduces it, so when you're done you'll end up with a very tall stack of filtered images with no negative values that have already been reduced in size at that point.

We've gone through several iterations of this, taking it and running it through a fully connected layer. This is more of a standard neural network where each input connects to everything in the next layer with a weight for every value you can think of. a voting process so that each pixel value left in these filtered reduced images gets a vote on what the answer should be and this vote depends on how strongly you tend to predict an x or an o when this pixel is high this output usually an x or is usually an o so you can see that for this particular input the input was an x, this is what the convolved and filtered imaginary values are and over time we would learn that these things that are high when you see an x they get a strong vote for category of these lines represent the weights, the strength of the vote between these pixels and these responses, so now if we get a new entry that we have never seen before, these could be the final pixel values we can use these votes and do a process of weighted voting for both add them and in this case you know that it is a total of 0.92 for x and a total of 0.51 for o 0.92 is obviously more than 0.51 we declare x winner, this entry will have been categorized as a x, so this is a fully connected layer, so it just takes a list of feature values, in this case our filtered reduced pixels, and turns it into a list of votes for each of our output categories in this case. an then those vote in the next layer and so on until we get to the finals, we'll get into this in just a little second, but all of these are accumulated in the final now to get into the next level of detail about these neural networks, let's leave that aside our x y In the detector for a while we had eight by eight pixel images so 64 pixels total we now consider a two by two pixel image so just a four pixel camera and what we would like to do is categorize the images it takes as a completely solid imagelight or completely dark a vertical image a diagonal image or a horizontal image now the trick here is that simple rules can't do it so both are horizontal images but the pixel values are completely opposite in both so No I can tell well if the top left pixel is white and the top right pixel is white then it must be horizontal because that violates the other now of course you could make more complicated rules to do this the point is that when you go for more images large you can't create simple rules that capture all the cases you want, so how do we do it?

We take these four input pixels and separate them. We call them input neurons, but they just take these pixels and turn them into a list of numbers the numbers correspond to the brightness minus one is black plus one is white zero is medium gray and everything else is in the middle so this takes this small image and converts it into a list of numbers which is our input vector now each of these You can think of as having a receptive field. This is an image that makes the value of this input as high as possible, so if you look at our top input neuron, the image that makes that number as high as possible is a top left pixel. that's white what makes the value of that and it doesn't care what the other pixels are, that's why they are marked so you can see that each of them has its own corresponding receptive field, the image that makes the value so high Anyway You can go now, we are going to build a neuron, so when people talk about artificial neural networks and a neuron, we will build it little by little.

The first thing you do to build a neuron is take all these inputs that you add. So in this case, this is what we would get, so the value of the neuron at this point is 0.5. Now the next thing we do is add a weight. We mentioned the weighted voting process earlier, so what it looks like is each of these entries. it is assigned a weight between plus and minus one and it gets the value that is multiplied by that weight before adding it, so now we have a weighted sum of these input neurons and we will represent it visually by showing the white positive weights and the negatives in black. and the thickness of the line is approximately proportional to the weight and when the weight is zero we will omit it to minimize visual clutter, so now that we have a weighted sum of the inputs, the next thing we need to do is squash the result.

So since we're going to do this a lot, it's good if we always guarantee that the answer is between plus and minus 1 after each step, which prevents it from growing numerically. A very convenient function is this S-shaped sigmoid squashing function. This particular one is called a hyperbolic tangent, there is another thing that is confusingly called sigmoid, it is a little different but has the same general shape, but the characteristic of this is that you can enter your input, you know, draw a vertical line and see where it intersects. the curve follows that until you use a horizontal line towards the y axis and you can see what is the squashed version, the squashed version of your number, so in this case 0.5 turns out to be a little less than 0.5 0.65 turns out be about 0.6 and As you go up this curve, you can see that no matter how big your number is, what you get will never be greater than one and, similarly, will never be less than negative one, so you take this infinitely long number line and you squash it. that everything falls between plus and minus one, so we apply this function to the output of our weighted sum and then we get our final answer, so this weighted sum and pumpkin is almost always what people talk about when they talk about an artificial neuron now.

We don't have to do this just once, we can do it as many times as we want with different weights and this collection of weighted sum and squashed neurons can be thought of as a layer loosely inspired by the biological layers of neurons in the human cortex, so each one of these has a different set of weights. Here to keep our picture really simple, we're going to assume that these weights are either more a white line minus a black line or missing zero completely, so in this case now we have our neuron layer we can see that the receptive fields have become more complex. .

If you look at the neuron in the first layer at the top, you can see how it combines the inputs from the top left pixel and the bottom left pixel. Both weights are positive. The lines are white and what comes out of your receptive field is that if both pixels on the left are white, then you have the highest value you can have. If we look at that layer of neurons and we look at the bottom one, we can see that it takes its inputs from both pixels on the left, oh sorry, on the right, but it has a negative weight that connects it to the bottom right neuron, so that your receptive field is what activates it to the maximum is a white pixel in the upper right. and a black pixel at the bottom right now we can repeat this because the outputs of that first layer of neurons look a lot like our input layer, there is still a list of numbers between -1 and 1 so we can add additional layers and We can do this as many times as we want, each time each neuron in one layer is connected to each neuron in the other layer by some weight, so in this case you can see how the receptive fields can become even more complex and now we are starting. to see patterns that look like the things we're interested in solids vertical diagonals horizontal combining these elements now there's one more thing we can do remember our rectified linear unit we can have different neurons here instead of a weighted sum and squash them we can just have something that take the input and spit out 0 if it's negative and the original value if it's positive and so, for example, if we have an input whose receptive field is the one on top and the second layer is completely solid. white and we connect it with a positive weight to the rectified linear unit neuron at the top, then of course what would maximize that is all the solid white input, but if we look at the neuron just below that is connected with a weight negative, then that flips everything and what activates it to the maximum is an input that is completely black.

Now we're really starting to get the set of patterns that we can imagine using to decide what our images will be, so we plug them back into a final output layer, this output layer. is the list of all possible answers that we hope to get from our classifier originally they were x and o now they are four solid categories vertical, diagonal and horizontal and each of these entries has one vote, but you can see that very few of them are connected, this network assume that most of those votes are zero, so to see how this plays out, let's say we start with an entry that looks like the one on the left with uh, this is obviously a horizontal image with a dark bar on top and a white bar at the bottom we propagate it to the input layer and then we propagate it to the first hidden layer and you can see, for example, the neuron at the top that combines two input neurons, one is light and one is dark, so I can imagine that you are adding plus one and minus one and getting a sum of zero, that's why it's gray.

Its value is zero now. If you look at the neuron at the bottom of that first hidden layer, you can see that it also adds an input that is negative and one that is positive, but it is connected to one by a negative weight and to the other by a positive weight, so that actually what you see is that your weighted sum is minus one and minus 1. So what you get you can see is the opposite of your receptive field, that means that it is activated to the maximum but negatively, that's why it is black, we pass to the next layer and you can trace these things, so anything zero plus zero will give you zero if you look at the neuron in the At the bottom of this second hidden layer you can see that yes, it is adding a negative and a negative, both connected by a positive weight, so it will also be negative, which makes sense because you can see that its receptive field is exactly the opposite of what the input is right now, so it's on all the way, just negative and then, when we trace this to our next layer, you can see that following that bottom pair of neurons, because it's a negative value, it goes through the rectified linear unit and becomes zero, so it's gray, but if you look at the bottom neuron, it's connected with a negative weight, so it becomes positive, so the rectified linear unit really likes it and gives it a maximum value, so everything is zero except for that neuron at the bottom and finally, what? that means the only non-zero output is this horizontal one, so this network would classify the input image as horizontal because of this.

Now there is some magic here. Where did we get those pesos? Where did we get the intermediate filters? This is where we start to get into the topic, when we talk about adaptation learning, you know, learning and machine learning, it's all about optimization. These are learned through a lot of examples over time, so we'll leave that aside for just a moment. We'll get back to how they're learned in a moment, first we need to talk about optimization, so consider drinking tea. There is a temperature range where it is a delicious experience. It's warm, delicious and comfortable.

If your tea is much hotter, it is very. painful and it's no good, it's no fun at all and if your tea is colder than that, it's lukewarm and it's really mediocre, it's really not worth it, so this area at the top is the top, this is the best, This is what we are trying to find. In optimization we're just trying to find the best experience, the best performance, now if we want to find that mathematically, the first thing we do is turn it around just because that's how optimization problems are formulated, but instead it's the same kind of things.

To maximize the pleasure of drinking tea we want to minimize the suffering of drinking tea we want to find the bottom of that valley with the least amount of suffering possible. There are a few different ways we can do this. The first is to look at each point on this curve and simply choose the most low now, the trick is that we don't actually know what this curve is beforehand, so to choose the lowest one we have to do an exhaustive search which in this case would be making a cup of tea, asking someone to drink it, ask him. how they like to have another ask them how they like one to do it over and over again for every possible temperature and then choose the one that has the least suffering and the one that you enjoy the most this is effective, very effective, and it can also be very time consuming . of problems and so we look for a shortcut now because this is a valley, we can use our physical intuition and say hey, well, what if we just had a marble and we let it roll to the bottom of this valley that we wouldn't have to explore? each piece, so this is what's behind gradient descent, the way it works is that we start not knowing anything about this function, we make a cup of tea, someone tells us if they like it and then we change the temperature, we make another one. cup of tea a little colder we ask someone if they like it and find out that they actually like it a little less that tells us which direction we should go, we should warm up our next cup of tea and the change the difference between how much they like those two tells us that the slope tells us that the slope gives us an idea of how much hotter we can expect the next cup of tea to be, so we make another one and repeat the process and then slide again. a little to the side, make another cup of tea and figure out again which direction we should go.

We need to heat up to make a better cup or cool down to make a better cup and we repeat this until we get to the bottom of the curve, you'll know you're at the bottom when you change the temperature a little and the tea drinker says yes, It's exactly the same. I like it as much as the last one, that means you're There's a sort of flat valley bottom, so gradient descent is the top-notch trick to making fewer cups of tea. There's another thing you can do, which is use curvature. This is an advanced method in which you can make your original mug. of tea and then make one a little bit warmer and one a little bit colder and you can see how that curve of your function goes and if it's very steep and getting steeper, then you know you can take a giant step because you're probably not is close to the bottom and then you can do it again and if that curvature is starting to bottom out then you can take a smaller step because that's the sign that you're getting close to the bottom and it helps you do this in fewer steps as long as When As your curve behaves relatively well, which is not always the case, then there are ways that this can break, imagine that we are doing this on a hot day and in fact it turns out that if we cooled our tea, we would get a Tea really good ice cream which turns out to be even more popular with our tea drinkers, but gradient descent would never find this gradient descent always rolls to the nearest valley bottom, doesn't jump to see if there are valleyshidden somewhere. otherwise another problem is that let's say there is a movement on our curve, something is happening in the environment, we have noisy buses passing by and it affects the way people enjoy their tea, we may not be able to find this lower drop because we might get stuck in a dip further up the curve similarly, if we ask our tea drinkers to rate their tea drinking experience on a scale of one to ten, we get these discrete jumps in our function and if you imagine a marble rolling downhill, it doesn't always work well, and you can get stuck on a step without reaching the end.

Now all these things happen in real machine learning problems. Another, imagine you have very picky tea drinkers and if the tea isn't perfect, they hate it. It hates it and that's why you have these plateaus on either side and there's no sign to tell you that if you move a little bit you'll find that deep valley, so for cases like this, of course, we can always resort to extensive exploration. You'll find the best answer in each of those cases, but many times we just don't have the time like you have to brew and measure the pleasure of drinking 10 million cups of tea to get a good answer.

This isn't going to happen in my lifetime, so luckily there are some things in between that are more sample-efficient than exhaustive scanning, but a little more robust than gradient descent, like genetic algorithms, simulated annealing, things that Their defining characteristic is that they have a bit of random jumps that are a bit unpredictable and so make it harder to miss things. Everyone has their strengths and weaknesses. They tend to be good for different types of problems or different types of pathologies in their loss. They work, but they all help avoid getting stuck in local minima. The little valleys that gradient descent will get stuck in get away with making fewer assumptions and may take a little longer to compute than gradient descent, but not that much.

As an exhaustive exploration, you can think of gradient descent like a formula one race car and if you have a really nice, well-behaved track it's fast, but if you put a speed bump on the truck you're done. Simulated annealing genetic algorithms. evolutionary algorithms, those are like if you meet a four wheel drive pickup truck, you can take a pretty rough road with them and get where you're going, maybe you won't get there in record time, but you'll get there and then you'll do a thorough scan . It's like traveling on foot, there is nothing stopping you from getting anywhere, you can travel little or literally anywhere, but it may take you a long time, so to illustrate how this works imagine that we have a model that we would like to optimize .

We have a research question: how many mm are there in a bag of m ms? So to answer this is easy: you buy a bag of m ms, you eat it. 53 you can count those m more. It's great, we know how many were in the first bag. Now when I did this I made a mistake and bought another bag and tried that one and got a different answer so now I can answer 53 or I can answer 57 either way. I'm only right half the time because I can. I didn't capture both bags with one answer and I could answer somewhere in the middle, but that's never right.

I've never opened a bag that had 55mm ms, so it's also unclear if that's the right answer and the situation doesn't get better with more. The mm bags I ate just get a little out of hand, so I change my goal from answer the answer correctly to answer the question in a way that is less incorrect, so in order to do that I have to be really specific about what I mean by how wrong I am and to do that I have this distance function, this deviation, which is the difference between my actual guess and the actual number of mm in a bag, so we ask for bag number i. this deviation is d sub i is just the difference between the estimated number and the actual number and then I have to take this deviation and convert it to a cost so a common way to do this is to square it, it's good the further apart things get The more expensive it is perceived and goes up the faster, so if there is one bag that is twice as far away as another, the cost is four times more, so it really penalizes things that are very far away from things that are close.

We don't penalize as much and if we don't mind if we don't want to penalize too much things that are way out there, we could use the absolute value of the deviation, so if it's outside of double, it's just going to be double. the cost, but really we could use anything, we could use the square root of the absolute value, we could use 10 to the power of the absolute value of this deviation, anything that increases the further it gets from zero, we will be left with the deviation squared. this is very common, has some interesting properties and is a good example, so for the total cost of whatever assumption we make, if I assume that there are n estimated bags m in a bag, then the loss function of this fancy curly q l of that assumption is we simply add the square of the deviation associated with each bag of m ms d1 to dm squared, so each deviation is actually the number of m ms in that bag minus the assumption, so I raise it to squared and we can write it with fancy addition notation like this so this is my loss function this is the total cost this is how wrong I am when I make an assumption nst so since we have computers you can write a little bit of code and you can make a exhaustive exploration and I can tell if I guess something between 40 and 70 how wrong I would be with this data and you can plot it and visually we can look at this and we can say, hey, look, there is the lowest value and we can say what is the value of the guess which gives me the lowest. loss, that's what that um notation argument means right there and this best guess is about 55 and a half mm.

The problem is solved, so this is an example of numerical optimization where we calculate this loss function and then we can do it essentially because it's simulated. We can do a thorough scan and simply choose the lowest value. Now, for this particular example, there's another fun way to find it. We know that at the bottom of this curve the slope is 0. It's the only place on the entire curve where that is true. flat, we can use a little calculus to find that, feel free to tune out if calculus isn't your thing, but it's not that bad, so we find the slope of the loss function with respect to our assumptions and set it equal to 0 and we solve it to find for which assumption it is true, then we take our loss function this sum of the square of the differences of the count and the assumption and we take the derivative of that with respect to our assumption and the derivative of a sum. is the same as the sum of the derivatives, we take the derivative of that, we just lower the exponent, so twice that adds up because all of that is equal to zero, we can divide it by two and it will still be true, so now the sum of our deviations are 0.

So to further simplify this, it's the sum of all the actual bag counts multiplied by the sum of our guess once for each bag. If we have m bags, then it's m multiplied by our guess and then we can move that to the other side of the equal sign and we divide both sides by the number of bags m and what we get is that our best guess is the sum total of the number of m ms we found across all bags divided by the number of bags or the average count per bag, so this is a really neat result and it's things like this that get people so excited about optimizing with a little math and calculation, you can get this good theoretical result.

Now, it's worth noting that this is only true if you use a squared deviation as your cost function, so that's one of the reasons people like it so much is because it tends to give good results like this , but there is this analytical shortcut to find what the best guess is. We'll come back to this in a few minutes now, how does optimization change? How do we use it in our neural network to find these weights and these features? So what we want to do, we know what our error function is, and how wrong our guesses are.

So in this case, we have a labeled data set, which means that a human has already looked at this input on the left and said, hey, that's a horizontal image, the true values are what we know should be the correct answer. , zero votes for everything except horizontal, which should have a vote of one. so let's say initially that we have a neural network in which all the weights are random and it gives us meaningless results, it says well, yes, everything has some number associated with it, but it's nothing like the correct answer, well, we can find the error for each one. category and add it up and find a total error and that's how wrong our neural network would be for this example here's our loss here's our error now the idea with gradient descent is that we're not just adjusting one thing, not just adjusting our assumption of the number of m ms, we are adjusting a lot of things that we want to go through and adjusting each weight in each layer to reduce this error a little bit, now that is a little difficult to do because in order to do that, what you can do is find a analytical solution like we did before to go ahead and move our assumption a little bit up and a little bit down and finding the slope is really expensive considering it's not a one-dimensional solution. problem, you could have hundreds or millions of different weights that we need to adjust, so calculating that gradient and that slope requires hundreds or millions more passes through the neural network to figure out which direction is downhill, enter backpropagation , so remember that we found the nice analysis solution to what we were happening in the mm estimation case, so we would love to be able to do something like that again.

If we had an analytical solution, we could jump directly to the correct answer, so the slope in this case is the weight change. or sorry, it's an error change for a given change in weight, that's the slope here, so there are many ways to write that, but error delta weight delta d air d wait we'll use this partial error partial weight just because is more correct but all these things mean the same if I change the weight by one how much will the error change? What is the slope? So in this case it would be -2 and we would know that we need to increase the weight to get closer to the bottom.

This not only tells us the direction in which we should move, but it gives us an idea of how far we should go. It doesn't tell us exactly where the bottom is, but it tells us which direction we should adjust now if we know the error. In the function example, we can do an analytical solution and we can find that we can calculate that slope exactly, so in this case the change in error for a given weight change is just the derivative of our error function here, which in this case it's the weight squared, so the derivative is two times the weight, the weight is negative one, so the answer is oh, a slope of negative two that tells us what we need to know about which way adjust now with neural networks, of course, there are many.

It's more complex than that, but we can actually analytically calculate the slope of the function where we are, we don't know where the minimum is, but we can find the slope without having to recalculate the value of everything every time and that's how it works, imagine. the most trivial neural network in the world that has one input, one output, one hidden layer with one neuron, so it has an input connected by a weight w1 to an intermediate value connected by a weight w2 to an output value, so that the intermediate value is just x times that weight, then the derivative of y with respect to the weight is x, what that means is that if I change w1 and move it by one, then the value of y will change by the value of x, whatever x is, we have the slope of this piece of the function similarly, it's simple, we can read that whatever the value of y is, multiply it by the weight w2 we get e, so if we want to find the slope of the error function for a given change in y the answer is w2 if I changed y by one then the error changes by the amount w two now chaining means we can take these two things and just multiply them so that by inspection we can see that in this small neural network, if we take x, we multiply it by w1.

We multiply that by w2 we get the error and now what we would like to know is if I change that w1 by a certain amount how much does the error change, in this case we simply take that complete expression and take the derivative with respect to w1 and a bit of calculus quite a bit trivial turns out to be x multiplied by w2 and what we can see then is that we can substitute in these steps this change in y with respect to w1 is the same as is that if we want to advance in thestring, we want to know how much a change in w1 affects the error.

What is d e d w 1? We can actually break it down into steps and say, "Okay, if i change w 1 how much does y change and then if I change y how much does the error change this is chaining and this is what lets us know in which direction we want to change the error, we allows us to calculate how much we can change this weight to help make that happen and there is nothing stopping us from doing this over and over again if I have a weight deep in my neural network and I want to know how much my error will change if I modify it up or down.

I want to know the slope of my loss function with respect to that weight, then I can decompose it and say: okay, well, if I change the weight, how much does a change if I change a? How much does b change if I change b? how much does c change? and chaining it completely is now called backpropagation because to calculate it we actually need the value at the end, we have to start with the error value so we can calculate each of these to the depths of the network, but we can still do it now. The reason you have to go back is let's say we want to know what it should be if I change the error.

If I change to, how much does the error change? Well, let's assume that I already know how much the error is. to change if change b what is this back propagation step? What is the extra link I need to add to this chain? is good, how much does b change if I change a? If they are connected by a weight, how do I incorporate them? that weight, we know that two neurons connected in this way are represented by this b is the weight multiplied by the value of a, so we can take a small derivative here and get the change in b with respect to a is w, so this step the backpropagation change can be represented by any weight that is good now we know we have sums in our neural network that's another thing we have to deal with if I know how much my error changes with a change in z then how much would it change? with a change in one of the inputs to this a z where that input becomes a sum, well I can write the expression for z adding all the inputs if I want to know how much z changes with respect to a change in a i just take the derivative and turns out to be one, so this is a trivial backpropagation step.

Now it is the most interesting of all if I know how much the error changes with respect to a change in b and then I want to know how much it changes. changes with the input and to that sigmoid function, then I can say okay, a sigmoid function mathematically looks like this and I can take the derivative of b with respect to a and um, one of the beautiful things about the sigmoid function is that the derivative actually it looks like this, um, it's just the function value multiplied by one minus the function value, which is one of the reasons why sigmoids are perhaps so popular in deep neural networks, so this step as well It is simple to calculate, since none of these steps have We had to recalculate all the values in the neural network.

We have been able to rely on things that have already been calculated. What are the values in each of these neurons? That's what makes backpropagation mathematically efficient and that's what allows us to do it efficiently. training neural networks is why every element in a neural network, no matter how exotic, must remain differentiable so that we can perform this exercise of finding what the link in the chain is when we apply the chain rule on our derivatives. so that we can calculate the back propagation, we can propagate it back and again rectify linear units. If we know how much the output affects a change in the error, we want to know how that extends to the input.

We can write the function of a rectified linear unit. we can take its derivative and then use it in our chain rule, so imagine now we have this example labeled, we calculate the answer that this random neural network that is nothing special will give an answer that is completely wrong and then we propagate the error and we adjust each of those weights a little bit in the right direction and do it again and again after you do it a few thousand times, this stochastic gradient descent goes from this totally random, completely connected neural network to something that is a much more efficient and it is able to give answers that are much closer to the correct answer, so going back to our convolutional neural networks, these are the fully connected layers, this is how they are trained, they can also be stacked, this Back propagation does not apply only to these fully connected layers but also to the convolutional layers and pooling layers, we will not go through them or calculate the chain rule for them but you can do it too and by going through this, This whole stack of different layers is trained on a bunch of examples in these cases of labeled the features for each convolutional layer. so you learn not only the weights but also the features and then over time those representations become something that allows you to predict very well what is an x and what is an o.

Besides that, there are other things we can use optimization for, so there are a bunch of decisions here that we haven't addressed yet, how do we know how many features to put in each convolutional layer? How do we know how big they should be? How many pixels per side? How we choose the size and stride of our grouping windows in our fully connected layers, how many layers we have and how many hidden neurons we put into each of these decisions, these are called hyperparameters, they are also values that we can choose, but they are the next level. control how everything happens next and to see how well they perform we have to train everything on all the images from start to finish, but the same principles apply, we can adjust them and choose them to get the best possible result in a In many cases It's worth noting that there simply isn't enough computing available in the world to try every possible example, so what we have now are some recipes, some things that researchers have stumbled upon that seem to work well and are reused. . but there are a lot of places where there are a lot of combinations of these hyperparameters that haven't really been tested yet, so there's always a chance that there are some combinations that work even much better than what we've seen so far. we don't have to use convolutional neural networks just for images, any two dimensional or three dimensional data works fine, what matters is that in this data things that are closer together are more closely related than things that are far away, it matters if two things are in adjacent rows or columns, so in images this is clearly the case, the location of a pixel in an array of pixels is part of the information if you were to randomly shuffle the rows and columns, which would lose the information that is there . which makes this suitable for convolutional neural networks, anything that can make it look like an image can also be suitable for convolutional neural networks, for example if you are working with audio, you have a really nice x-axis, your columns can be subsequent time steps.

I don't want to mix them because the order in which things happen in time is important and you can make your rows have the intensity in different frequency bands, going from low frequency to high frequency again, the order matters there and therefore , to be able to take the sound. then, and make it look like an image, you can apply this processing to it and find patterns in the sound that you couldn't conveniently find otherwise. You can also do this with text with a little work you can do. each of your rows is a different word in the dictionary and then you can make your columns the position and the sentence or position location that occurs in time.

Now there are some limitations here. Convolutional neural networks only capture local spatial patterns, so if your data can't be made to look like an image or if it doesn't make sense then it's less useful, so for example, imagine you have customer data that they have columns that represent things like names and ages, addresses, emails, purchases, transactions, browsing histories and these customers are listed if if you were to rearrange the rows or rearrange the columns, the information itself wouldn't really be compromised, it would all still be there , it would also be consultable, searchable and interpretable. Convolutional neural networks don't help you here, they look for spatial patterns, so if the spatial organization of your data is not important, it won't be able to find what matters, so as a general rule, if your data is just as useful after of swapping your columns with each other, then you can't use convolutional neural networks, you shouldn't use convolutionals.

Neural networks are a great takeaway from this, so they are really good at finding patterns and using them to classify images. That's what the best ones are right now. The takeaway from this is not that you should code your own convolutional neural network. networking from scratch, um, you can, it's a great exercise, it's a lot of fun, but when you actually use it, there are a lot of mature tools that are useful and waiting to be applied. The bottom line from this is that you will be asked to make many subtle decisions about how to prepare your data and feed it, how to interpret the results, and how to choose these hyperparameters.

For this, it is useful to know what will be done with your data and what everything is. It means you can make the most of these tools, good luck.

Watch Video & Subscribe

If you have any copyright issue, please Contact