MIT Deep Learning Basics: Introduction and Overview

May 02, 2020

Welcome everyone to 2019. It's really good to see everyone here surviving in the cold. This is article 6.S094 Deep

learning

for autonomous vehicles. It is part of a series of courses on

deep

learning

that we teach throughout this month. The website where you can get all the video content, lectures, and code is

deep

learning.mit.edu. The videos and slides will be available there along with a Github repository accompanying the course. Assignments for registered students will be emailed later in the week. And you can always contact us with questions, concerns, or comments at hcai, human centered AI, at mit.edu.

So let's start with the

basics

, the fundamentals. To summarize in one slide, what is deep learning? It is a way to extract useful patterns from data in an automated way with as little human effort as possible, so it should be automated. As? The fundamental aspect that we will talk a lot about is the optimization of neural networks. The practical nature that we will provide the code and so on is that there are libraries that make it accessible and easy to do some of the most powerful things in deep learning using Python, TensorFlow, and friends. The hard part always with machine learning artificial intelligence in general is asking good questions and getting good data.

More Interesting Facts About,

mit deep learning basics introduction and overview...

Many times the interesting aspects of what the news covers and many of the interesting aspects of what is published and the prestigious conferences in an archive, in a blog post, is the methodology. The hard part is applying the methodology to solve real-world problems, to solve fascinating and interesting problems. And that requires data, it requires asking the right questions about that data, organizing and labeling it, selecting aspects of that data that can reveal the answers to the questions you ask. So why has there been this great advance in the last decade in the application of neural networks, the ideas contained in neural networks?

What has happened? What has changed? They've been around since the 1940s. And ideas were percolating even earlier. The digitization of information, data. The ability to easily access data in a distributed manner across the globe. All kinds of problems now have digital format. They can be accessed through learning algorithms. Hardware; Computing, both CPU Moore's Law and GPU and ASIC, Google's TPU systems, hardware that allows effective and efficient large-scale execution of these algorithms. Community; People here and around the world can work together, talk to each other, and fuel the fire of enthusiasm behind machine learning. github and beyond.

The tools; We will talk about TensorFlow PyTorch and everything else that allows a person with an idea to reach a solution in less and less time. Increasingly higher levels of abstraction allow people to solve problems in less and less time and with less and less knowledge, where the idea and the data become the central point, not the effort, that takes you from an idea to the solution. And there have been many interesting developments. Some of which we will talk about from facial recognition to the general problem of scene understanding, image classification, speech, text, natural language processing, transcription, translation in medical applications and medical diagnosis.

And cars can solve many aspects of perception in autonomous vehicles with driving area, lane detection, object detection, digital assistance, on your phone and beyond those in your home. Ads, Netflix recommendation systems to search on social networks, Facebook. And, of course, the successes of deep reinforcement learning in gaming practice, from board games to StarCraft and Dota. Let's take a step back. Deep learning is more than a set of tools for solving practical problems. Pamela McCorduck said in '79 that "AI began with the ancient desire to forge gods." Throughout our history, throughout our civilization, human civilization we have dreamed of creating echoes of whatever is in this mind of ours in the machine.

And the creation of living organisms from 19th century popular culture, from Frankenstein to Ex Machina, this vision is the dream of understanding intelligence and creating intelligence, has captivated us all. And deep learning is the essence of that. Because there are aspects of learning that capture our imagination about what is possible. Taking into account the data and the methodology, what to learn, learning to learn and, beyond, how far it can take us. And here only 3% of the neurons and one millionth of the synapses of our own brain are visualized. This incredible structure is in our minds and there are only echoes of it.

Small shadows in our artificial neural networks that we are able to create. However, those echoes are inspiring to us. The history of neural networks in this pale blue dot of ours began quite a long time ago with summers and winters, with excitements and periods of pessimism. Starting in the '40s with neural networks and the implementation of those neural networks is a perceptron in the '50s; with ideas of backpropagation, restricted Boltzmann machine, recurrent neural networks in the 70s and 80s; with convolutional neural networks and the MNIST dataset with datasets that started leaking LSTM, bi-directional RNNs in the 90s; and the rebranding and rebirth of neural networks under the banner of Deep Learning and Deep Belief Nets in 2006;

The birth of ImageNet, the data set that the possibilities of deep learning can bring to the world, was first illustrated in recent years in 2009. And AlexNet, the network that ImageNet did exactly that with some ideas like dropout and Networks Improved neural networks over time, year after year, improving the performance of neural networks. In 2014, Yann LeCun called the idea of GANs the most exciting idea of the last 20 years, Generative Adversarial Networks, the ability to generate data with very little supervision to generate ideas after forming a representation of them. By understanding the high-level abstractions of what is extracted from the data, we will be able to generate new samples.

Creating, the idea of being able to create instead of memorize is really exciting. And on the side applied in 2014, with DeepFace, the ability to perform facial recognition. There have been many advancements on the computer vision front, being one of them. The world was inspired, captivated in 2016 with AlphaGo, and in 17 with AlphaZero beating the best players in the world in Go with less and less effort. The problem that for much of the history of artificial intelligence was considered unsolvable. And new ideas with capsule networks and this year, the year 2018 was the year of natural language processing. There are a lot of interesting advances from Bert at Google and others that we will talk about about advances in the ability to understand language, understand speech and everything, including the generation that has been built around that.

And there is a parallel history of tools that begins in the 1960s with the perceptron and wiring diagrams. They are finishing this year with PyTorch 1.0 and TensorFlow 2.0. These really solidified, exciting, powerful ecosystems of tools that allow you to do a lot, do a lot with very little effort. The sky is the limit, thanks to the tools. So let's go from the big picture to the smaller one. Everything should be made as simple as possible. So let's start simple with a small snippet of code before jumping into the details and going over everything that is possible in deep learning.

At a very basic level, with just a few lines of code, actually six here, six little snippets of code, you can train a neural network to understand what's happening in an image. The classic, which I will always love, the MNIST data set, handwritten digits where the input to a neural network or machine learning system is an image of a handwritten digit and the output is the number that is in that digit. It's as simple as the first step: importing a TensorFlow library. Second step: import the MNIST dataset. 3rd step, like Lego bricks, stack the neural network layer by layer on top of each other, with a hidden layer, an input layer and an output layer.

Step 4: Train the model as simple as a single line: model tuning. Evaluate the model in Step 5 on the test data set. And that is. In step 6, you are ready to deploy. You are ready to predict what is in the image. It's that easy. And a lot of this code, obviously, is much more complicated or much more elaborate, rich, interesting and complex and we will make it available on github in our repository that accompanies these courses. Today we will publish the first tutorial on driver scene segmentation. I encourage everyone to go through this. And then on the tools side on a slide, before we dive into neural networks and deep learning.

The tools side, among many other things, TensorFlow is a deep learning library, an open source library from Google. The most popular today. The most active with a great ecosystem. It's not just something you import into Python and it solves some basic problems. There is a whole ecosystem of tools. There are different levels of API. Much of what we will do in this course will be the higher level API with Keras. But there is also the possibility of running TensorFlow.js in the browser and TensorFlow Lite on the phone. In the cloud, without needing any computer hardware or any libraries set up on your own machine, you can run all the code we provide in the cloud with Google Colab, Colaboratory.

And Google's optimized ASIC hardware is optimized for TensorFlow with its TPU-Tensor Processing Unit capability to visualize tensorboard models provided by TensorFlow Hub. And just, this is a complete ecosystem that includes, most importantly, I think, blog documentation that makes it extremely accessible to understand the fundamentals of the tools that allow you to solve problems, from natural language processing to computer vision and GAN generative adversarial networks. everything else with deeper application learning, etc. That's why we're excited to work on both the theory in this course, in this lecture series, and the tools, on the applied side of TensorFlow.

It really makes these ideas exceptionally accessible. Deep learning, then, is central to the ability to form increasingly higher levels of abstractions of representations in raw data and patterns. Increasingly higher levels of pattern understanding. And those representations are extremely important and effective in being able to interpret data. Under certain representations, the data is trivial to understand: cat versus dog, blue dot versus green triangle. With others it is much more difficult. In this task, drawing a line under polar coordinates is trivial, under Cartesian coordinates it is very difficult, well, impossible to do accurately. And that's a trivial example of representation.

So our task with deep learning, with machine learning in general, is to form representations that map topology. This, whatever topology, whatever rich problem space you're trying to solve from the raw inputs, map it in such a way that the final representation is trivial to work with, trivial to classify, trivial to perform regression on. , trivial to generate new samples of that data. And that representation of ever higher levels of representation is really the dream of artificial intelligence. That is understanding, making the complex simple, as Einstein said a few slides ago. And that with Juergen Schmidhuber and whoever said it, I don't know, that has been the dream of all of science in general.

The history of science is the history of the progress of compression, of the formation of increasingly simpler representations of ideas. Models of the universe of our solar system with the Earth at the center are much more complex to do physics than a model in which the Sun is at the center. These increasingly higher levels of simple representations allow us to do extremely powerful things. That has been the dream of science and the dream of artificial intelligence. And why deep learning? What is so special about deep learning in the great world of machine learning and artificial intelligence?

It is the ability to increasingly eliminate the input of human experts, to eliminate the human being from the picture, the costly and inefficient human effort of human beings from the picture. Deep learning automates much of the extraction of raw data and brings us closer to raw data without the need for human involvement or human experts. Ability to form representations from raw data, rather than a human needing to extract features as was done in the 1980s and 1990s in the early 1920s to extract features that machine learning algorithms can then work with . Automated feature extraction allows us to work with sets ofincreasingly larger data, eliminating the human completely, except for the supervisory labeling step at the end.

It does not require the human expert. But at the same time our technologies have limits. There is always a balance between excitement and disappointment. The Gartner hype cycle, as much as we don't like to think about it, applies to almost all technologies. Of course, the magnitude of the peaks and valleys is different. But I would say that we are at the peak of inflated expectations regarding deep learning. And that's something we have to think about as we talk about some of the interesting ideas and possibilities of the future. And with the autonomous cars that we will talk about in future conferences of this course we are in the same situation.

In fact, we are a little past the peak. And then it's up to us. This is MIT, the engineers, and the people working on this in the world to take us through the nadir, to take us through the future as the highs and lows of emotion move toward the plateau of productivity. Why else not deep learning? If we look at real-world applications, especially with humanoid robotics, robotic manipulation, and even autonomous vehicles, most aspects of autonomous vehicles do not heavily involve machine learning today. The problems are not formulated as data-driven learning, but rather are model-based optimization methods that do not learn from data over time.

And then, according to the speakers, in these two weeks we will see how much machine learning is starting to appear. But in the examples shown here with Boston with incredible humanoid robotics at Boston Dynamics, almost no machine learning has been used to date except for trivial perception. The same with autonomous vehicles. Machine learning and deep learning have hardly been used except with perception. Some aspect of enhanced perception from visual texture information. Furthermore, what is starting to be used a little more is the use of recurrent neural networks to predict the future, to predict the intention of the different actors in the scene in order to anticipate what the future is.

But these are very early steps. Most of EC's current success, 10 million miles away from Moses' achievement, has been primarily attributed to non-machine learning methods. Why else not deep learning? Here is a really clear example of unintended consequences of ethical issues. We have to really think about it. When an algorithm learns from data based on an objective function, a loss function, the power, the consequences of an algorithm optimizing that function are not always obvious. This is an example of a human player playing the coastal racers game with a boat racing game where the task is to go around the track and try to win the race.

And the goal is to get as many points as possible. There are three ways to get points. The completion time, how long it took you to finish. The final position, where you were in the ranking. And picking up cones called turbos, those little green things on the road. They give you points. Okay, pretty simple. Therefore, we design an agent, in this case an RL agent that optimizes rewards. And what we find here on the right, the optimal, the agent discovers that the optimal actually has nothing to do with finishing the race or qualifying. You can get a lot more points just by focusing on the turbos and collecting those little green points because they regenerate.

So if you go around in circles over and over again, you hit the wall and collect the green turbos. And that is a very clear example of a formulated and well-reasoned objective function that has totally unexpected consequences. At least without considering those consequences beforehand. And that shows the need for AI safety for a human being in the machine learning loop. That's why it's not exclusively about deep learning. The challenge of deep learning algorithms, of applied deep learning, is to ask the right question and understand what the answers mean. You have to step back and look at the difference, the distinction, the levels, the degrees of what the algorithm is achieving.

For example, image classification does not necessarily involve scene understanding. In fact, it is very far from understanding the scene. The classification may be very far from being understood. And the data sets can vary drastically between the different benchmarks of the data sets used. Professionally taken photographs versus synthetically generated images versus real-world data. And real-world data is where the big impact is. Many times one does not transfer to the other. That's the challenge of deep learning. Solving all these problems of different lighting variations, imposing variation, variation between classes, all the things that we take for granted as human beings with our incredible perception system.

Everything has to be resolved in order to understand a scene better and better. And all the other things we have to close the gap on that we're not even close to yet. Here's an image from Andrej Karpathy's blog from a few years ago of former President Obama stepping on a scale. We can classify, we can do semantic segmentation of the scene, we can do object detection, we can do a little bit of 3D reconstruction from a video version of the scene. But what we can't do well are all the things we take for granted. We cannot differentiate images in mirrors from those in reality.

We cannot deal with the scarcity of information. With just a few pixels on President Obama's face we can still identify Mr. President. The 3D structure of the scene in which there is a foot on top of a scale and behind it there are human beings from a single image. The things we can trivially do using all the common sense semantic knowledge we have cannot do the physics of the scene where there is gravity. And the biggest thing, the most difficult thing is what some people think. And what some people think about what other people think, etc.

Mental models of the world can infer what people think. Being able to infer that there has been a lot of interesting work here at MIT about what people are looking at. But we are not even close to solving that problem. But what you're thinking about, we haven't even begun to really think about that problem. And we do it trivially as human beings. And I think at the core of that I think I'm harboring the problem of visual perception. Because it's something that we really take for granted as human beings, especially when we're trying to solve real-world problems, especially when we're trying to solve autonomous driving.

We have 540 million years of visual perception data, so we take it for granted. We don't realize how difficult it is. And we cannot focus all our attention on this recent development of a hundred thousand years of abstract thinking, capable of playing chess and capable of reasoning. However, visual perception is extremely difficult. In each layer of what is required to perceive, interpret and understand the fundamentals of a scene. In a trivial way to show, these are all the ways you can mess with these image classification systems by adding a little noise. In recent years many articles have been published and much work has been done to demonstrate that these systems can be altered by adding noise.

Here, with 99% accuracy, the predicted dog adds a little distortion and immediately the system predicts with 99% accuracy that it is an ostrich. And you can do that kind of manipulation with a single pixel. This is simply a clear way to show the gap between image classification in an artificial data cell like ImageNet and real-world perception that needs to be resolved, especially in critical life situations like autonomous driving. I really like Max Tegmark's visualization of this rising sea in Hans Moravec's landscape of human competition. And this is the difference as we go forward. And we discussed some of these machine learning methods: There is human intelligence, human intelligence in general.

Let's call Einstein here. That is able to generalize about all kinds of problems, from common sense to incredibly complex ones. And then there's the way we've been doing especially data-driven machine learning, which is Savant, which is specialized intelligence. Extremely intelligent at a particular task, but unable to transfer except in a very narrow neighborhood in this landscape of different arts, cinematography, book writing at the tops and chess, arithmetic and theorem proving and vision at the bottom of the lake. . And there's this rising sea as we saw one problem after another. The question: can the deep learning methodology and approach of everything we are doing now keep the seas rising or if fundamental advances need to occur to generalize and solve these problems? .

And so, from the specialized where the hits are, the systems essentially come down to a given data set and the ground truth for that data set, here's the cost of an apartment in the Boston area: being able to enter several parameters and, based on those parameters, predict the cost of the apartment. That is the basic premise approach behind successful supervised deep learning systems today. If you have good enough data, that is a good enough ground truth that can be formalized, we can figure it out. Some of the recent promises that we will do a full series of lectures in the third week on deep reinforcement learning show that, from raw sensory information with very little annotation to self-play, if systems learn without human supervision they are able to function extremely well in these restricted environments. context.

The question of a video game. Here, pong to pixels, you will be able to perceive the raw pixels of this pong game as raw input and learn the fundamental, quote-unquote, physics of this game. Understand how this game behaves and how to win it. This is a kind of step towards general-purpose artificial intelligence. But it is a very small step because it is a very trivial simulated situation. That is the challenge before us, with less and less human supervision, to be able to solve enormous real-world problems. From top supervised learning, where most of the teaching is done by humans throughout the annotation process, through labeling all the data, showing different examples, and more and more to semi-supervised learning, reinforcement learning and supervised learning that removes the teacher from the picture.

And make that teacher extremely efficient when necessary. Of course, data augmentation is one way we'll talk about. So, taking a small number of examples and playing with that set of examples, increasing that set of examples, through trivial and complex methods of cropping, stretching, shifting, etc. Including generative networks that modify those images to grow a small data set into a large one to minimize, to decrease more and more the input that is a human being, is the input of the human master. But that's still quite far from the incredibly efficient teaching and learning that humans do. This is a video and there are many of them online about the first time a human baby walks.

We learn to do this, you know, it's a one-time learning. One day you are on all fours, on all fours, and the next day you raise both hands and then you discover the rest. A drink. Well, you can play with that. But the point is that you are extremely efficient. With just a few examples we can learn the fundamental aspect of how to solve a particular problem. In most cases, machines need thousands, millions, and sometimes more examples, depending on the critical nature of the application. The data flow of supervised learning systems consists of input data, a learning system, and results.

Now, in the training stage for the result, we have the basic truth. And then we use that fundamental truth to teach the system. In the testing stage, when it comes to light, there is new input data on which we have to generalize with the learning system, we have to make our best guess. In the training stage which are the processes with neural networks, given the input data for which we have the ground truth, pass them through the model and obtain the prediction. And since we have the ground truth, we can compare the prediction with the ground truth and observe the error.

And based on that error it adjusts the weights. The types of predictions we can make are regression and classification. The regression is continuous and the classification is categorical. Here, if we look at whether the regression problem tells what the temperature will be tomorrow. And the classification formulation of that problem says whether it will be hot or cold or some threshold definition of what is hot or cold. That's regression and classification. And the classification front canbe of multiple classes, which is the standard formulation. We are tasked with saying, that is, there is only one particular entity that can be only one thing, and then there are multiple tags or one particular entity that can be multiple things.

And in general, the input to the system cannot be just a single sample of the particular data set and the output does not have to be a particular sample of the actual data set. They can be a sequence, a sequence to a sequence, a single sample to a sequence, a sequence to the sample, etc. From video subtitles to translation, natural language generation and, of course, from one-to-one computing to general computer vision. Well, that's the bigger picture. Let's go back from the big to the small to a single neuron inspired by our own brain, the biological neural networks of our brain, in the computational block that is behind much of the intelligence of our mind.

The artificial neuron has inputs with weights plus a biasing and activation function and an output. It is inspired by this as I showed before. Here the thalamocortial system is visualized with three million neurons and 476 million synapses. The entire brain has one hundred billion trillion neurons and one trillion synapses. ResNet and some of the other next-generation networks have tens of hundreds of millions of synapse edges. The human brain has ten million times more synapses than artificial neural networks, and there are other differences. The topology is asynchronous and not built in layers. The learning algorithm of artificial neural networks is the back propagation of our biological networks that we do not know.

That is one of the mysteries of the human brain. There are ideas but we really don't know. The power consumption of human brains is much more efficient than networks, that is one of the problems we are trying to solve and ASICs are starting to solve some of these problems. And the learning stages in biological neural networks never really stop learning. You are always learning, always changing in both hardware and software. In artificial neural networks many timesThere is a training stage, there is a different training stage and there is a different testing stage when you release the object into the wild.

Online learning is an exceptionally difficult thing that we are still in the early stages of. This neuron takes some inputs, the fundamental computational block behind neural networks, takes some inputs, applies weights which are the parameters that are learned, summarizes them, puts it into a non-linear activation function after adding the bias, also parameter learned and gives a way out. And the task of this neuron is to become excited based on certain aspects of the layers, characteristics of the inputs that follow before. And in that ability to discriminate, to be enthusiastic about certain things and not to be enthusiastic about others, it contains a small portion of information, whatever the level of abstraction it may be.

So when you combine many of them, you have knowledge. Different levels of abstractions form a knowledge base that is capable of representing, understanding or even acting on a particular set of raw inputs. And you stack these neurons in layers, both in width and depth, increasing later. And there are many different architectural variants. But they start with this basic fact: with a single hidden layer of a neural network. The possibilities are endless. You can approximate any arbitrary function. A neural network with a single hidden layer can approximate any function. That means that any other neural networks with multiple layers, etc., are just interesting optimizations of how we can discover those functions.

The possibilities are endless. And the other aspect here is the mathematical foundations of neural networks with differentiable weights and activation functions that are such that in a few steps from inputs to outputs they are deeply parallelizable. And that's why the other aspect of computing, the parallelization of neural networks, is what enables some of the interesting advances in the graphics processing unit, the GPUs, and the ASICs, the TPUs. The ability to run across machines and GPU units on a very large distributed scale to be able to train and perform inference on neural networks. Activation functions. These activation functions together have the task of optimizing a loss function.

For regression, that loss function is a mean squared error, usually there is a lot of variance. And for classifications, cross entropy loss. In cross entropy loss, the ground truth is 0.1. In the mean square error is a real number. And so with the loss function, the weights, bias, and activation functions are propagated into the network from the input to the output. Using the loss function we use the backpropagation algorithm, which I did for an entire lecture last time, to adjust the weights. So that the air flows back into the network and adjusts the weights so that once again the weights that were responsible for producing the correct output increase and the weights that were responsible for producing the incorrect output decrease.

Stepping forward gives you the error. The backward pass calculates the gradients, and based on the gradients, the optimization algorithm combines a learning rate and adjusts the weights. The learning rate is how quickly the network learns. And all this is possible on the numerical calculation side with automatic differentiation. The optimization problem given the gradients being computed and sufficient backward flow to the gradient network is stochastic gradient descent. There are many variants of these optimization algorithms that solve various problems, from Dying ReLU to Vanishing Gradients. There are many different parameters and impulses, etc. Really, that comes down to all the different problems that exist.

It is part of a series of courses. This is 6.S094 deep learning for autonomous vehicles. doing in the cold. It's really good to see everyone here solved with nonlinear optimization. Mini batch size. What is the correct size of a lot? Or actually called mini batch when it is not the entire data set based on which to calculate the gradients to adjust the learning. Do you do it for a very large amount? Or do you do it with stochastic gradient descent for each data sample? If you listen to Yann LeCun and a lot of recent literature, small minibatch sizes are good.

He says: "Training with large mini-batches is bad for your health. More importantly, it's bad for testing error. Friends don't let their friends use mini-batches larger than 32." A larger batch size means more computational speed because the weights do not need to be updated frequently. But a smaller batch size empirically produces better generalization. The problem we often try to solve on the broader scale of learning is overfitting. And the way we solve it is regularization. We want to train on an unmemorized data set to the point that it only does well on that trained data set. So you want it to be generalizable into the future to things you haven't seen yet.

Obviously this is a problem for small data sets and also for parameter sets that you choose. Here is an example of a sine curve trying to fit particular data versus a 9 degree polynomial, trying to fit a particular set of data with the blue points. The 9 degree polynomial is overfitted. It works very well for that particular set of samples, but it doesn't generalize well to the general case. And the trade-off here is that, as you train more and more at a given point, there is a deviation between the error going to 0 on the training set and going to 1 on the test set.

And that is the balance we must achieve. That is done with the validation set. So you take a part of the training set for which you have the ground truth and call it the validation set as a whole and evaluate the performance of your system on that validation set. And after you notice that your training network is performing poorly on the validation set for an extended period of time, that's when you stop. That's an early stoppage. Basically, it's getting better and better and then there's a period of time, there's always noise of course, and after a period of time it's definitely getting worse.

We have to stop there. This provides an automated way to discover when it is necessary to stop. And there are many other regularization methodologies. Of course, as I mentioned, school dropouts are a very interesting approach. And it's a variation of simply, with a certain kind of probability, randomly removing nodes in the network, both incoming and outgoing edges, randomly throughout the training process. And there is normalization. Obviously, normalization is always applied at the input. So whenever you have a data set with different lighting conditions, different variations, they'll get different sources, etc., you should put them all on the same level ground.

So we are learning the fundamentals of the input data rather than less relevant semantic information such as lighting variation, etc. That's why we normally always normalize. For example, if it is computer vision with pixels from 0 to 255, it is always normalized to 0 to 1 or -1 to 1 or normalized by the mean and standard deviation. That's something you should almost always do. What enabled many revolutionary advances in recent years is batch normalization. You are performing the same type of normalization later in the network, looking at the inputs to the hidden layers. And the normalization based on the batch of data you are training on was normalized by the mean and standard deviation.

Because batch normalization with batch renormalization addresses some of the challenges of normalizing during training on mini-batches in the training data set, it does not directly map tests to the inference station. And therefore, by maintaining a moving average, both in training and testing, it is possible to asymptotically approach a global normalization. So this idea across all weights, not just inputs, across all weights, normalizes the world at all levels of abstractions that you're forming. And batch renorming solves many of these problems through inference. And there are many other ideas, from layers to weights, instance normalization, and group normalization.

And you can play with many of these ideas in the TensorFlow playground. At Playground.tensorflow.org, I highly recommend it. So now let's go over a bunch of different ideas, some of which we'll cover in future lectures. And what is all this in this world of deep learning, from computer vision to deep reinforcement learning, different small-level techniques and large-scale natural language processing? So, convolutional neural networks, which allows for image classification. This convolution of filters then slides over the image and can take advantage of the spatial invariance of the visual information that a cat in the upper left corner is the same as the features associated with cats in the upper right corner, and so on.

Images are just a set of numbers and our task is to take that image and produce a classification and use the space in the spatial variation of the visual information to make a convolution filter slide across the image. And learn that filter instead of assigning the same value to features that are present in multiple regions of the image. And stacked on the top feature, these convolution filters can form high-level abstractions of visual information and images with AlexNet, as I mentioned, and the ImageNet dataset and the challenge that captivates the world of what is possible with networks neurons have been growing more and more.

Human performance was improved by carefully replacing GoogLeNet with the startup module. There are different ideas that came up throughout ResNet with residual blocks. And SENet more recently. So the object detection problem is one step, the next step in visual recognition. So image classification is simply taking the entire image and saying what is in the image. Localization by object detection consists of finding all the objects of interest in the scene and classifying them. The region based methods as shown here. Faster R-CNN takes the image, uses a convolutional neural network to extract features in that image and generate region proposals.

There are plenty of candidates here that you should consider. And within those candidates, it sorts out what they are and generates four parameters, the bounding box that captures that thing. So object detection localization ultimately boils down to a bounding box, a rectangle with a class. That is the most likely class that is in that bounding box. And you can really summarize the region based methods when generating the region proposal here, alittle pseudocode and do a for loop over the region proposals and perform detection in that for loop. Single-Shot methods eliminate the for loop. There is only one step, you had several, take an SSD for example, shown here.

Take a pre-trained neural network that has been trained to do image classification, stack a bunch of convolutional layers on top, from each layer extract features that can then in a single pass generate class bounding boxes, class bounding box predictions. limits and the associated class of this. bounding box. The trade-off here, this is where the popular yellow v123 comes from, the trade-off here is often in performance and precision. Therefore, single shot methods usually have less performance, especially in terms of accuracy on objects that are very far away or rather on objects that are small in the image or very large.

Then the next step in visual perception, visual understanding, is semantic segmentation. That's where the tutorial we present here on github covers. Semantic segmentation is the task now, unlike a bounding box or classifying the entire image or detecting that the object is a bounding box, is to map at the pixel level the boundaries of what the object is. Each of them, in the classic full-scene full-scene segmentation classification, which class each pixel belongs to. And the fundamental aspect that we'll cover a little bit or a lot more on Wednesday is taking an image classification network and slicing it up at some point.

And then perform the encoding step of compressing a representation of the scene. And taking that as a representation with a decoder that upsamples densely. Then, taking that representation, the classification is upsampled to the pixel level. So upsampling has a lot of tricks that we'll talk about. They're interesting, but they ultimately boil down to the encoding step of forming a representation of what's happening in the scene and then the decoding step of upsampling the pixel-level annotation and classifying all the pixels. individual. And as I mentioned here, the underlying idea that is most widely and successfully applied in computer vision is transfer learning.

The most commonly applied form of transfer learning is to take a pre-trained network like ResNet and slice it at some point. It's about cutting out the fully connected layers, some aspects, some parts of the layers and then taking a data set, a new data set and retraining that network. So what is this for? For every computer vision application in the industry. When you have a specific application like you want to build a pedestrian detector. If you want to build a pedestrian detector and you have a pedestrian data set, it is useful to take ResNet trained on ImageNet or COCO and take that network, cutting out some of the trained layers in the general case of visual perception. and then retrain it on your specialized pedestrian dataset.

And depending on how large the data set is, the sum of the previous layers of the pre-training network must be fixed, frozen. And sometimes it doesn't depend on the size of the data. And this is extremely effective in computer vision but also in audio speech and NLP. And as I mentioned with pre-trained networks, they are ultimately forming representations of the database upon which the regression classification is done, the prediction is done. But the clearest example of this is the autoencoder that forms unsupervised representations. The input is an image and the output is exactly the same image.

So why do we do that? A network bottleneck is added where the network is narrower in the middle than at the inputs and outputs. You are forced to compress the data into a meaningful representation. That's what the autoencoder does. You are training it to reproduce the result and reproduce it with a latent representation that is smaller than the original raw data. That's a really powerful way to compress data. It is used to remove noise, etc. But it's also simply an effective way to demonstrate a concept. It can also be used for embedding. We have a large amount of data and want to form an efficient compressed representation of that data.

Now, in practice, this has no oversight. In practice, if you want to form a useful and efficient representation of the data, you must train it in a supervised manner. You want to train it on a discriminative task where you have labeled data. And the network is trained to identify cats and dogs. The network that is trained discriminatively in an annotated supervised learning mode can form a better representation. However, the concept remains. And one way to visualize these concepts is the tool that I really love, projector.tensorflow.org, it's a way to visualize these different representations, these different embeddings.

You should definitely play and you can insert your own data. Okay, going more and more in this direction of forming unsupervised representations are generative adversarial networks. From these representations, new data can be generated. And the fundamental methodology of GANs is to have two networks. One is the generator, the other is the discriminator and they compete with each other to make the generator better and better at generating realistic images. The generator tasks from noise to generate images based on a certain representation that are realistic. And the discriminator is the critic who has to discriminate between the real images and those generated by the generator.

And they both improve together. The generator gets better and better at generating real images to fool the discriminator and the discriminator gets better and better at distinguishing between real and fake until the generator is able to generate some amazing things. As evidenced by the work with NVIDIA, the ability to generate realistic faces has skyrocketed in the last 3 years. So these are samples of celebrity photographs that they have been able to generate. All of them are generated by GAN. There is the ability to generate temporally consistent videos over time with GAN. And then there's the capability shown at the bottom right and Nvidia, I'm sure, we'll also talk about the pixel level of semantic segmentation.

Then, from the semantic pixel segmentation on the right, we will be able to completely generate the scene on the left. All raw HD rich pixels on the left. The world of natural language processing is the same: form representations, form embeddings with Word2Vec, ability to go from words to form representations that can then be used efficiently to reason about the words. The idea of forming representations over data requires a huge vocabulary, you know, over a million words. You want to be able to map it to space in a Euclidean sense, since the Euclidean distance between words is also semantically far apart from each other.

So things that are similar are together in that space. And one way to do it with skip grams, for example, is to look at a source text and turn it into a large body of text, a supervised learning problem, learning to map and predict from the words of one particular word to all their neighbors. So let's train the network on connections commonly seen in natural language. And based on those connections we can know which words are related to each other. Now the main thing here is. Now I won't go into too much detail, but the main thing here is that the input vector represents the words and the output vector represents the probability that those words are connected to each other.

The main thing is that both are discarded at the end, the main thing is the middle layer, the hidden layer. That representation gives you the embedding. Which represent these words in such a way that in Euclidean space those that are semantically closest together. They are semantically together where they are not semantically separated. And natural language and other sequence data, text, voice, audio and video depend on recurrent neural networks. Recurrent neural networks can learn temporal data, temporal dynamics in the data. They sequence data and are capable of generating sequence data. The challenge is that they cannot learn the context in the long term.

Because when you unroll a neural network, train it by unrolling it and doing backpropagation without any tricks, the gradient backpropagation fades away very quickly. Therefore, you cannot memorize the context in a longer form of the sentences. Unless there are extensions here with LSTMs being used, the long-term dependency is captured by allowing the network to forget information, allowing it to freely pass through information over time. So, what to forget, what to remember and decide each time what to take out. And all of those aspects have gates that can be trained with sigmoid and tanh functions. Bidirectional real recurrent neural networks from the 90s are an extension often used to provide context in both directions.

So recurrent neural networks simply define learning representations of what happened in the past. Now, in many cases it is not a real-time operation, but it can also look into the future. Look at the data that is left out of the sequence. Therefore, it is beneficial to make a forward pass to the network beyond the current one and then return. The encoder-decoder architecture in recurrent neural networks is widely used when the sequence in the input and the sequence and the output are not trusted to be the same length. The task is to first encode with the encoder network everything that comes, everything that is in the input sequence.

This is useful for machine translation, for example. So, encode all the information in the input sequence in English and then in the language you're translating into given that representation, continue feeding it into the decoder's recurrent neural network to generate the translation. The input can be much smaller or much larger than the output. That is the architecture of the encoder-decoder. And then there are improvements. The focus is the improvement on this encoder-decoder architecture that allows you, instead of taking the input sequence, to form a representation of it and that's it. Allows you to look back at different parts of the driveway.

Therefore, don't simply rely on the single vector representation of the entire input. And there's been a lot of excitement around the idea, as I mentioned, part of the dream of artificial intelligence and machine learning in general has been to remove the human more and more from the picture. Be able to automate some of the difficult tasks. So Google's AutoML and just the general concept of neural architecture search, NasNet. The ability to automate the discovery of parameters of a neural network. And the ability to discover the actual architecture that produces the best result. So with neural architecture search, you have basic core modules similar to ResNet modules, and with a recurrent neural network, you continue to assemble and network.

And assemble in such a way as to minimize the loss of overall classification performance. And it has been shown that you can then build a neural network that is much more efficient and much more accurate than the state of the art on classification tasks like ImageNet shown here with a plot that is at least competitive with the state of the art and SCnet. It's very exciting that, as opposed to, like I said, stacking Lego pieces yourself, the end result is essentially you step back and say here I have a data set with the labels with the ground truth, which is what Google , the dream of Google AutoML.

If you have the data set, please tell me what type of neural network will work best on this data set. And that is. then all you bring is the data. It builds the network through this neural architecture search and returns the model to you and that's it. It solves, it makes it possible to resolve the exception, you know, it solves a lot of the real world problems which essentially boil down to I have some classes where I need to be very precise, here's my data set. And then I turn the problem of a deep learning researcher into the problem of what is traditionally more commonly called the type of data science engineer where the task, as I said, is focused on what is the right question and what is the data? correct to answer that question.

And deep reinforcement learning goes down the path of decreasing human involvement. Deep reinforcement learning is the task of an agent to act on the world based on observations of the state and rewards received in that state, knowing very little about the world and learning from the very sparse nature of the reward. Sometimes it's only when you're in the context of the game that you win or lose. Or in the robotics contest, when you successfully complete a task or not with a very meager prize, you are able to learn to behave in that world. Here, cats learn how Bell relates to food and much of the incredible work on open AI and the deep mind.on robotic manipulation and navigation through personal play in simulated environments.

And of course, the best of our own highly trafficked deep reinforcement learning competition that you all can participate in. And I encourage you to try to earn it without supervised knowledge. There is no human oversight through meager simulation rewards or through self-play constructs capable of learning to operate successfully in this world. And those are the steps we are taking towards artificial general intelligence. Here's what's exciting about the innovative ideas we'll be talking about on Wednesday from natural language processing to generative adversarial networks. They are capable of generating arbitrary high resolution data, creating data. Really, from this understanding of the world to deep reinforcement learning to be able to learn how to act in the world, very little information from human supervision is advancing more and more and there have been a lot of interesting ideas with different names.

Sometimes misused, sometimes overused, sometimes misunderstood Transfer learning, meta-learning, and hyperparameter architecture search, basically removing a human as much as possible from the menial tasks and involving a human only on the fundamental side, as I mentioned with the racing boat on the ethical side. And the things that we humans at least claim to be pretty good at is understanding the big fundamental questions, understanding the data that allows us to solve real-world problems, and understanding the ethical balance that needs to be struck to solve those problems. . Well, on the bottom right I show that that is our job here in this room, our job for all the engineers in the world: to solve these problems and move forward through the current summer and through the winter, if it ever comes.

With that I would like to thank you and you can get the videos, the code, etc. online at deeplearning.mit.edu. Thank you so much guys.

Watch Video & Subscribe