YTread Logo
YTread Logo

Physics of AI

Mar 26, 2024
I'm happy to be here and introduce you to the work we've been doing in my group for the last two years, which we call the

physics

of AI, so there's a lot I want to tell you, so let's get started. Well, then the presentation will be divided into three parts that are approximately the same length as the first part. I want to give you a little bit of the context of the talk. I think a lot of what I'll talk about in the first part. you know, you know. I mean, no one can ignore charging PT and stuff like that, but I'd still like you to know, make sure we're all on the same page so we can move on to the other two parts in particular in In the first part I'll try to point out that We are really witnessing the emergence of intelligence in artificial systems and that this emergence comes from very simple components.
physics of ai
Well, this is what I will talk about in the first part, then in the second and the third. what our focus is, you know, what we call the

physics

of AI. In the second part, I'm going to give you a very controlled experiment that we've been running to try to see, you know, witness the emergence, you know it firsthand, and that part will focus on the emergence of media through the lens of data diversity and then in part three, which is more recent work we did in the summer, we'll talk about a different approach to physics which is to come up with toy mathematical models to study emergence dynamics and then focus on gradient descent dynamics, okay, so let's start with the first part and, you know, we're getting inundated with headlines like this these days, you know, every day there's an article in the New York Times about AI, so This is what I believe.
physics of ai

More Interesting Facts About,

physics of ai...

It's been a couple of months now you know when openai announced dalitu so the title says you know we need to talk about how good AI is getting and I think we really need to talk about it it's really amazing and I know that. Some people are still skeptical. I think fewer and fewer people know this every day, but some people are still skeptical, particularly because we've been talking about the AI ​​revolution for over a decade, and we've been talking about a lot of advances. that they're right on the horizon and then they never happen etc. but I mean from my perspective this time it's different this time it's real and the reason I say that is you know look at this image so it was generated this picture.
physics of ai
You can see it if you can read the title it was generated using, know the words in enjoy and understand that an artificial system understands such abstract concepts to get an idea of ​​what Infinite Joy means, what it really means, I think. personally, it means that he really understands those things and is able to manipulate, not only you know, not only is he able to create a chair that looks like an avocado, but he is also able to understand much more abstract concepts, manipulate them and merge them. So, this is what we're going to talk about.
physics of ai
So what has happened? Do you know why these things are suddenly taking off? And I claim that intelligence is emerging, so what has really been happening in the last five years? Is scaling okay? So this graph I think many people know, many of you have probably seen it, so you see this exponential growth in the size of the models, the size of the neural networks that we are training, where it is measured how it is measured. by the number of parameters, so you see it starts approximately almost exactly five years ago with an ai2 model that had approximately 100 million parameters and you know it's growing exponentially and the last one I have on this graph is you. you know, in January 2020, which is Microsoft that released the Turing model which at that time had 17 billion parameters and which was bigger, of course, you know if I continue with this in 2021 and 2022, you know this is the exponential point you wouldn't do.
I see something, okay, so if we want to continue with this plot, we need to go to a logarithmic scale and on a logarithmic scale it's very clear that it's just going up linearly, okay, okay, so you can see again, this is what same, basically the same. uh, models that we were talking about, also, in this product there is gpt3, which has 175 billion parameters, and there is Megatron Turing, which has 500 billion parameters, you know, and these things keep going up, and you know, interestingly, this is about You know the latest in terms of publicly known models, but you can imagine, of course, you know that all of those companies have not been dormant since the end of 2021, so there is more on the horizon and you know there were. the announcement, yesterday, that a new model will be integrated, so this is part of okay, now you really know, I think this image here really captures very vividly what's happening, so this is a Google image of the model . which is called party so this is one of those text to image models and the cool thing about party is that it's purely a Transformer model and I'll remind you in a minute what Transformers are and here what they did again text to image and and the text they gave is a portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the lawn of the Sydney Opera House, blah blah blah, holding a sign on his chest that says welcome friends and the interesting thing is that they give us the result of four different models that have different sizes, so a model with 350 million parameters, a model with 750 million parameters, three billion and 20 billion, and you can see the it gets better, you can literally see it like this, you just know that the kangaroo becomes more and more elegant and suddenly when you go from three billion to 20 billion, suddenly for some reason the text becomes correct, you know that before, before 10 billion parameters it somehow doesn't know how to write text and after 10 billion it knows how to write text.
Okay, so this thing that suddenly some properties seem to appear just by scaling them up, this is what we call emergence. Okay, so emergent behavior is, I think, the most important thing for the entire AI community, so this is another image I took from Google, maybe. I should have tried looking for more Microsoft images. Sorry, but anyway, in this image, what they're going to show you is how they train as they scale a large language model with more and more parameters. They see that you can do more and more things just by scaling it, so it has 8 billion parameters.
Of course you know that if you're trained in language modeling, you have some understanding of the language, you can answer some questions, and interestingly, you're also able to do some arithmetic operation that's not trivial and actually under a billion parameters. those models don't know how to do arithmetic, you scale up to 62 billion and suddenly you add a whole bunch of capabilities, for example now it's able to do translation that it couldn't do before you keep scaling up to 540 million and now suddenly you can knowing how to explain jokes that you already know before, he didn't receive a joke, now with the parameter of 500 billion he receives jokes, so it's really you. you know this, this emerging thing, you know, maybe a little more scientific, plot of this survey, from Google again, big language model emerging skills so you see different tasks and they have different models, you know, gpt3 Lambda, you know , chinchilla palm, all those. large language models and you can see that as you scale up and here you notice importantly that the x-axis is not just the number of parameters, it's a total computation time that was spent training so This combines both the number of parameters and the training data because the training data is of course the essence of this story and you can see again and again in all those examples that there is some kind of phase transition, not nothing happens for a long time and then suddenly they start being able to perform the task correctly.
I think the best one I like is arithmetic. You know, you train your model. You know, with a lot of web text with a lot of parameters and if it's not enough if it's below a certain threshold, then you don't know. how to do arithmetic and after that it just increases linearly and very quickly is able to do it completely well, so you know before I ask, you know the question that we want to study in this presentation, which is how intelligence arises, and you know . What does the physical have to do with the question? I'd like to make sure we're on the same page and I'll go over what Transformers is because at the end of the day one of my points is that everything in this story is simple so the components are very simple this is just an architecture Transformer which I'll explain in a second and you know you train with gradient descent to do the prediction of the next word in the large corpus of text, that's all there is to it, okay?
So this is the document we're going to talk about for the next five minutes. Attention, that's all you need. You know, it was published five years ago, just over five years ago. Let me tell you something extremely important to me. is that this is going to be the only reference in this talk and this is terrible behavior on my part, I really don't like it, the problem is that this field is exploding so much that if I were to make the appointment and if I had to explain everything related With the work we are doing it would take me at least another extra hour, if not two more hours, it would be very interesting.
I'd love to do it, but you know, I just don't have enough time. So, apologies, this is the only reference in this presentation. Well, what is attention first? What is it? Let's slow down the second one. What is it? Again, you already know classical neural networks, of course, you all know it very well. So, a classic neural network. You need a high-dimensional input, a vector another high dimensional vector, let's call it w, this is the filter and what the neuron calculates is just a linear function w point for example, the rectified linear. unit, okay or you just know the positive part, that's what the neuron calculates, okay, so the point is that those W, those, you know, filters, are being learned, let's run a gradient descent on them to find what the best Ws are to fit the training tasks that we're giving it, okay, this is a classic neural network.
I won't say more about this now than the Transformers leak, in my opinion the jump is that instead of operating on a single X input as an image. for example, it's going to operate on a set of inputs, so it's going to operate on a set of inputs and the key is that there will be an attention model that will be an additional layer, you know you layered them. operations in a neural network, that's what you do and here in the Transformer, one of the layers, you know you're going to alternate classical layers that are like this classical neuron that I just described with attention layers and what the attention layer does. attention in essence. is basically the same as a classic layer except it replaces those learned filters with the other element in the sequence you understand, it becomes the key point is that it becomes a relative machine instead of being an absolute machine, what I want say with that.
It's imagine you know what I mean if in an image there are two uh two similar objects, okay, this is my task and now you know, maybe in my training data I've only seen dogs and cats, this is all I've seen. , so if I tested the time I have an image with two birds I can't tell if there are two words in this image because I have only learned the absolute cat and dog filters but Transformers will have no problem doing this because it is not learning a filter from a cat and a dog filter is learning the concept of two and what it is to go and how it does it because it compares the different parts of the images and if there are two parts that match then it says Oh, I have a match so you know the answer is yes, there are two similar parts, so this is really what you know, this uh aptitude to compare elements in the input sequence, this is to me what really allows analogies, you are comparing things, this is the essence of reasoning and I think that's why those Transformers are so successful.
Adding to this point, sets from a modeling perspective are also very powerful. Everything is a set. I mean, you know, sets are the basis of mathematics, after all. for example, a sentence is nothing more than a set, not just a set of words, but a set of word-position pairs. It is important to remember the position in reinforcement learning when you make decisions and have a trajectory. You know you have a set of triplets. you know the action, you know the next action and the reward, okay, um, if you have graph data, you know you're trying to make a protein folding prediction, same story, you can see that too.As a whole, it's well and truly back to square one.
The transformer revolution is that transformers can now be applied to everything, in every field you know, where we have been doing AI for decades. The transformers are revolutionizing, those fields, okay, they surpass natural language processing, reinforcement learning, you know, and so on, now I I want to have a technical slide on how exactly the attention module works because when we move on to the experiments I will try to see in the experiment what is happening. You know, mechanically it's fine, so to get into the mechanics of what the Transformer does. I need to explain to you, do you know what exactly Computing is?
So you remember the way you remember attention. I told you that it compares different parts of the input, so let's say you have an input sequence X1 X2 up to xn and I want to compare one. You know my attention model will now look at an input element X in this sequence and how it will do the comparison. is that it will lead to a probability distribution over the other tokens, this is the attention distribution, this token will tell me how much attention it pays to all the other elements in the input sentence, okay and the way I'm going to calculate this it's literally what I told you before, it's just going to take the other input tokens, we call them tokens, in this case the other input tokens as kind of filters, so I'm going to test my internally this gives me a score now I want the probability distribution so I'm going to transform those scores that could be negative into a positive number.
It is very classic to do it through the exponential function. Other choices could be made, but the exponential is. well for many reasons so I put it through an exponential now I have a non-negative number that doesn't necessarily add up to one so let me re-normalize by normalizing the capital constant z and now this is the probability distribution that summarizes what's good, so Again, this Alpha I is a probability distribution that is given for the input token X, so each token will have a different distribution. Okay, so if you want, I can use this. You could use it as an array.
This is an attention pattern matrix. Matrix where I look at my input sentence both in the rows and in other columns and what I put there is just all the distribution that I can see, well, this is my attention pattern. Matrix now what the attention module does is take this input sequence X1 X2 up to xn and it will simply transform it into another sequence X1 Prime X2 Prime up to What you can do is just replace this x that you know I just defined with a probability distribution now just replace this x with a weighted average of all the other tokens weighted by the attention this guy is giving to the others, that's it.
Map What do I do now is add? a little more flexibility and instead of doing all the operations in the original space, you know, with the canonical base, which has no special meaning, I'll just allow myself to rotate the space, you know, both in terms of filters and in terms of the input and in terms of the recombination that is called in the Transformer language, the query key and the value. I see my X that I try to compare with others as a query. I use the other tokens when they act as filters. the keys and I see them when I recombine them in the formula sum of alpha ixi as a value, so I just had a linear operator, you know, on each of those occurrences and now this is what I'm going to optimize.
This is an attention. and what you do in Transformers is you pay attention to multiple heads, so you replicate this many times and then you just recombine them all linearly. This is a layer of attention. All Transformers do is alternate this layer of attention with a purely forward adjustment. a hidden layer neural network that acts independently on each token, okay, and there are normalizations that you know are important for training and residual connection, so you know, okay, these things don't really matter at least in the level of this discussion, okay, and that's it. This is a Transformer architecture, okay, now the question Z, okay, so this was, this is all for the background, now let's move on to the question, the question is how do you know intelligence like you?
I think you've all probably seen it while playing with the GPT I graph. I mean you can just play with it and yeah, it makes a lot of silly mistakes, but clearly there's some intelligence there, how does the intelligence come from gradient descent? Okay, just try to make the following symbolic prediction, so I won't explain it, but you. I know you need to define the loss function and in this case the loss function is just giving you a lot of text and you give it a partial sentence and you're just trying to predict the next token.
Okay, so you do a gradient descent based training to do the following, a word prediction, you have a large data set, very important, very large, very diverse, represents many things, you know, let's say about a billion tokens and you have a big Transformers, which is big, you know, maybe let's say you have about 100 layers or so. you know a hundred heads per layer and you remember that you know those tokens where they live in a high dimensional space in Rd, but you know that a sentence is just a bunch of discrete words, so you need to embed those discrete tokens in a high dimensional space dimension.
Let's say you embed them in the ten thousand dimension, okay, this is it, you have those ingredients, you run it for long enough, boom intelligence comes out of there, okay, so this is the truth, I really believe that no one in the planet has an idea of ​​what is happening. just no one we don't understand no one understands and if so please let me know and know that this is not the first time humanity has faced a problem like this, it has happened time and time again and this was expressed much more elegantly by Arthur Eddington in the context of quantum mechanics and it captures very well what's going on, something unknown is doing, we don't know what and that's really the situation, you really know, we don't even know what we're talking about. intelligence later, you know there is even a problem of definitions, so how can we start?
Do you know how we can move forward to understand this great system? Know. I emphasize that its breadth is important, it is key, which is why you have this large system. with many parts that interact with each other and there is emergent behavior from this phrase that I just said many parts of a large complicated system and emergent behavior this is what physics is about physics is about trying to decompose a system trying to see what were the actual key elements of the emergent behavior you are witnessing? Well, I'm not saying we want to use the physics tool.
What I'm saying is that we want to be inspired by your methodology. They've been grabbing this. problem with this problem for centuries let's see how they attacked it again I'm not saying at all this is AI physics this is not physics for AI completely different this is not a talk about you know reuse recite some of the tools that physics has developed over the last two millennia, it is not at all about developing new tools. I think they are necessary, but we can be inspired by the way they approach those things. So how does physics do it?
I think there are two main pillars of physics. They are controlled experiments and toy mathematical models you know when you want when you want to understand what water is made of you know why water can turn into ice and you know or vapor or you know like those different phases of water don't go look, you know, look stare at the waterfall and say: wow, this is amazing, the water is jumping, you know, this is too complicated, you need to make your experiments much more controlled, do you know how we discovered that? It's a nucleus in the atom, you know, it wasn't looking at a waterfall that we had to do, you know the famous Ritoford golf experiment, so do you know what our goal is going to be for the experiment with those big transformers?
That's what I want to ask again, the waterfall is, uh, GPT, you know, and you're not going to understand anything by looking at the waterfall. Another anger is like those toy mathematical models, so of course in physics we're very lucky because you know, for some mysterious reasons. nature is canonical, so you know you come back with a toy mathematical model for one situation and it applies everywhere. We don't know yet if this will be true in AI, but we should, we should definitely try, so what? are the target of the experiment and the harmonic oscillator of the AI ​​and what I'm going to do, you know, in parts two and three is just tell you that you know our attempts to try to answer those questions and I'm not saying for Anyone means that these are answers finals.
In fact, I know it's not finances, but it is. It's just to show you that you know the kinds of things we could try as a community to do more and that we're not the only ones doing it. I know a lot of other people are doing it, but in my opinion not enough yet, and you'll also see that, importantly and sadly, maybe our findings contradict the findings of the machine learning community over the last 20, 30, 50 years. , so it's really something. Something new is happening and we need to develop new tools again, so the first thing I will tell you is a controlled experiment in which we train a transformer in a very simple environment.
I think this is our goal for the experiment, we train it to solve a problem. system of linear equations, okay, it is a very simple task and we will see how data diversity matters even for this very simple task, so the second part will be about the non-convex dynamics of training a neural network from a hidden layer to model very special data that I will explain to you what the access code is and this is a toy model to really understand the emergence of edge detectors. I'll explain them all and this is based on those two articles that are in the file with a lot of people that you know from my group at Microsoft and also interns and students that you know from academia, so I just want to go through the list because they are fantastic people you know who contributed enormously, so the first articles I'll tell you about This is Lego, uh, the main author is Yin Zhang, who was a postdoc with us and now works full time with the acting group sponsors with a mathematician who , you know, decided now to spend more time on AI, so yeah, and such Wagner and then they all already know where they are in my group and then the second articles that I will tell you about, you know, I will explain all those terms, thresholds of learning across the edge of stability, these are with fantastic students from MIT and also with Ian. tatley, who recently joined my uwab group, so fantastic MIT students are Hong Jun and Felipe Suarez.
Okay, all three are really outstanding. OK. Very lucky to work with this group of people. Okay, so let's get started and do it. our first AI physics experiments, uh, we're going to train for real, as if it's no joke, it's going to be real training, uh, a neural network, a transformer to solve systems of linear equations and we'll see the simplest type possible. of the system of linear equations that actually looks like this, okay, even in high school this is too easy, but the idea is pretty much high school level math, so you have a system of equations like this.
Oops, sorry, I'll do it like this. you just have variables a b c d okay in this case and you define the relationship between them and you know Lego means learning equality and group operations so we're just going to make you know equality operations and there might be some group element that is apply to those variables here we are only going to talk about the group with two elements, that is, the sum, you know, you just leave it alone in the Identity or you flip it, you multiply it by minus one, these are all the operations that I am allowing, but but this framework allows something more general it is very interesting what happens in greater generality but in any case here I only tell you the variable B is equal to minus the variable to the variable D is equal to minus the variable C and A A I am giving you a value a equals plus one and C equals plus b so of course, or you know, we know how to solve this, just follow the chain, you know the equations, the chain of reasoning, so you start. with an equal plus one, then you look for the next a, the next a is here, okay, B is equal, so what I know is that you really understand when you're training those things that you're teaching to an artificial system that doesn't know anything. about the world, so you really have to explain every step in yourmind, each step of reasoning that you are taking and let's see how the artificial intelligence will take the step, so the first thing you do is associate this and that.
Okay, you realize, okay, those two occurrences of a are the same thing, so if I know that a is equal to plus one, I know that a is also equal to plus one. Okay, now I apply a minus group operation, so now I need to manipulate I have my concept of plus one. I have moved my plus one concept from this part of the sentence to here. Well, this is the path of association. I associate those two concepts now I need to manipulate this concept. You know, this was a plus one. but I applied to the group operations minus, so I need to change it to minus one and then I just need to do a local operation where I put this minus one on B, so there are three steps of reasoning here, plus I'm associating. concept I'm saying that the appearance of those two variables a are the same.
I'm manipulating the concept. You know I have a plus one. I apply a minus. This should become a minus one and I'm doing something local. You know, uh uh. local operation or just collect the information that you know, for example, for this B. I send the information that I have on the right side of the equal sign to this B. Okay, three operation manipulation, you already know the local operation and the Association and what we are going to do. is that we're going to train a Transformer and we're going to try to understand how it does those three very basic building blocks of reasoning Association manipulation and local manipulation, you know, okay, so we're going to train a Transformer, so again, you know exactly what I put in the input sequence, okay, so I embedded in a high-dimensional space all those tokens, the equal sign. minus sign, you know, the comma sign, etc., all those things are high-dimensional vectors.
I need to and also have a position on coding, you know, reflect what the position of all those things is. Now I train a big transformer, you know, uh. alternation of uh you know, attention layers that we fit for all the layers at the end of all this at the end of many layers I have a new representation and what I'm going to do is simply add a linear The classification addresses each of the variables, so I want to know if it is plus one or minus one. This is a very simple binary sort that I do at the end of the rendering, okay and again here I'm just saying the key.
Point association manipulation, basic building blocks of recently, so let's do an experiment, let's train a transformer the size of a bird, which is approximately, you know, you should think approximately 10 layers, 10 attention heads, thousand dimensions for the embedding, okay, about another magnitude less. from what I told you would lead to the emergence of intelligence, so you don't know, it's not incredibly different, but smaller, okay, again, the sentences will look like this, they correspond to a certain chain of reasoning. My sentence here is mixed up, you know it's not. is not given in the order of the chain of reasoning, that would be too easy.
You know that different edges can appear anywhere. You can see it as a graph. You know you start with root node one and then you know it's equal to plus one. So this Edge is labeled with a plus sign and so on. Well, now we're just going to train with systems of linear equations like this one with 12 variables. This is an arbitrary choice. And what else do I want to say? We are going to arrange the variables in the order of the chain of reasoning, so in this case variable A is the first variable, B is the second, variable is the third and variable C, for example, is the last .
They are arranged not in the order in which they appear in the sentence, but in the order in which they appear in the chain of reasoning, which is a natural order. Now I'll show you a graph of how accuracy increases as you train. the Transformer for the different variables as order in your chain of reasoning, so it looks like this, again I told you that I have 12 variables, I order them from 0 to 11. And on the x axis you have the number of training epochs. and the y axis is just a test time precision, okay, we're never going to talk about training time precision, you know, those things always hit 100 training precision, that's not the point, we just talk about the trial time, okay? generalization and you can see very well that they all shoot up to 100, okay, and you can even see that it's interesting, you know, so first, it's the first variable that shoots up to 100.
It makes sense, the network learned for the first time, already you know, figure out what this variable is. is the second given rather than the third variable and so on for those of you who are paying close attention you will notice that it is the first variable, the second variable, the third variable and let's summarize that the last variable starts to increase before the other. This is no accident. We can explain exactly what is happening. I don't have time for this presentation. You can think about it tonight if you want. It's a really fun exercise to think about what's going on, but you know we understand exactly what's going on.
Anyway, it's happening, yeah, okay, cool, uh, Transformers work, you know, uh, you know, they generalize and everything, but are they okay, too? So, you know, as I was telling you at the beginning, that a Transformer could learn to test whether there are two similar objects in an image and I told you that maybe you have training data only with cats and dogs, but the power of Transformer is who will then be able to spot the detectives like two birds, even though he's never seen any birds, okay? So really the power, the intelligence, is not the classic IID machine learning generalization, the power is really the power of extrapolation when you're out of distribution, when you're seeing something you've never seen in training, that's where you know that intelligence happens in my way.
What we're going to do is we're going to get a little complicated with the network and we're only going to provide monitoring on the first six variables, meaning, you know, in a chain like this, okay, I only have six variables here. but we are only going to provide supervision on the first three variables, that is, a training time, the network only suffers a loss if it is wrong with b or e, okay, if it is wrong with f d or C, nothing happens, well, in others words, another way of saying it. is that I am training it to solve systems of linear equations with six variables but I am going to try it with systems of linear equations with 12 variables, okay, so I will really try to see how it can generalize to more variables than has been seen in the training and I'm just doing this setup so that positional encoding isn't a problem, okay, we don't want, we don't want different train and tester sequence lines, that would be annoying, we want the same sequence lens, so that's why you know which I only monitor the first half, but these things still appear, you know, like the X as input, it just doesn't appear in terms of the loss and let's see what happens, it doesn't work, okay.
It doesn't work, so this network you know this bird type network. I say it is initialized randomly. You know we trained her from scratch. It works perfectly for the variables you have seen in training. You know up to variable five, but if you go to variable six. only one more variable barely gets to 70 precision and all the others, you know, no, no, you know, increase beyond 15. You know, and I think classically you would say, I mean, yeah, of course, you know that is. It's how you would know how to solve a 12 variable equation, you've only been trained to see six variables, you know, you have no incentive to go the extra mile and solve the seventh variable, so you know, this is totally expected, so now you know. do something crazy, uh, which is what anyone in deep learning would say.
I mean, you were crazy not to do that, which is, we're going to train with more data, we're not just going to train with Lego, but we're going to train with a lot more data and what the deep learning people would say is they would say that don't train from scratch, first take a pre-trained model, first take a pre-trained model that has seen maybe a huge amount of text on the Internet, you know, maybe. that's all over Wikipedia, you've learned how to predict the next token on Wikipedia, now take this as your starting point and now adjust this in Lego, okay and let's see if it works better, this is what no one in deep learning will ever tell you .
You know, just train all the tasks you're interested in, of course something classically trained in machine learning would say the opposite. you know, something classically trained, we would say just train on what matters to you. You know, no, don't try to confuse. the system so let's do the experiment and the way you know here I frame it as you know either training on Lego alone or training on Lego plus normal text but really you know for all the experts and I think all of you are experts um, it's really I take a pre-trained model, the really original pre-trained birth and I refine it in Lego and let's see what happens and the answer is that it works amazing, so the pre-trained work is the model that has been trained on text and now it trains continuously on Lego, you know it does extrapolation as one more variable, no problem, it goes almost to 100, you know, two more variables go to 80 and so on and so on and you can really take all those things to 100 .just by now you know what is called stochastic depth.
We're not going to talk about that now. You know, the point is really the difference between those two things. How does adding this diversity of data suddenly make the network actually smarter? I mean, in a way. You know, it's the case that the way this was solved here was a little bit overfitting for Lego but of course it's not overfitting in the classical notion, this is test accuracy, it's not overfitting in the sense of training time, test time, this is overfitting the feeling that you have developed internal circuits that are only fine-tuned in this very special Lego and that do not represent the general type of methodology that you want to apply to solve your system, whereas perhaps this type because you have seen much more data. has been forced to learn this much more general purpose circuit, which extrapolates better and, in fact, this is exactly what we're going to discover, discover, so what's happening, what you can do is scan through the attention head, you know you can literally plot them, you enter a Lego sentence and look, let's say you know the first few layers, the eleventh, attention head, what's the attention pattern?
How does it look? You can just look at it and what I'll see in the pre-trained model, okay, so you haven't seen this, you've never seen a Lego sentence, you've never seen a Lego standard, but you see those very structured arrays, what is this? ? Do you know this? So, something diagonal to the left, this is. exactly the local manipulation that I was telling you about what this is on the right you know I don't have time to analyze it but this is exactly the association you know the fact that there are two red dots like this is exactly the fact that you know the variable a, for For example, it appears twice in the sentence and this is activated in exactly the two locations of the variable a, so what I'm saying on this slide is that the previous training has given rise to attention heads that implement the local manipulation. which you can think of it as a convolution and the association pattern and you know maybe that's why this pre-training works so well because it has learned those general circuits on the other end if you scan through the attention head of the model that only has If you've seen Lego, you won't see those things, you'll see a lot of noise and maybe you'll see a lot of those transmission heads.
You know, I don't want to talk too much about it now, but you won't see this structure. So again, these heads are natural for solving systems of linear equations, in fact, they are natural more generally for reasoning, so now this is a hypothesis at this point, okay, back to the physics analogy, you know, through the experiment we let you know the hypothesis and then I want you to know, actually, test them. I mean, in general, this is the approach in science, so what we're going to do is forget about text training. I mean, anyway, this was a strange idea.
What we are going to do is have discovered those structures. What happens if we put them in the initialization from the beginning? So we're going to put an association head and a local manipulation head on a Red Bird at initialization and we'll train on Lego and the question. It's going to be: do you mimic pre-workout performance? and the answer is yes, almost exactly, so the one on the right, this guy has never seen a Wikipedia article, you know, but he basically works as well as a pre-trained model. Again I'm just repeating here that you know that the diversity of data forces you to learn these general purpose circuits, which is what allows you to solve the problem with those tools and in turn makes it possible for the extrapolation to look correct,so this is just part of the Lego document, we have a lot more stuff in there in particular, you know, it gives you suggestions for architectural modifications, you know, maybe we want this association and manipulation header at the beginning of initialization, maybe we should You know, it randomly initializes all those heads, uh, all the time, okay, okay.
Another way to look at it is that you can think of all stories as an implicit regularization that you know comes from the diversity of data. This is another way of saying it. which is maybe more in line with classical thinking, okay, so that's it, you know, for the second part, which was about a controlled experiment with Transformer and now we're going to switch gears and talk about models of toys for the emergency, so we will try to let you know. a little more mathematically grounded because here you know what it was, you know that we were observing patterns of attention and I'm sure we were designing experiments to test our hypothesis, etc., but we still have no idea what exactly it is in the training dynamics that leads you to Learn about the beautiful structure of association and manipulation, address the shitty, noisy transmission, you know, heads you'd get from Lego training alone, so do you know what's going on?
Can we try to understand what the training dynamics are? leads to one or the other, so this seems very, very difficult. I think it is an achievable goal. I don't know how many years it will take, but it is achievable, but since we want to do something now, we are. I'm going to try to make something a little easier, so let's talk about the simplest possible emergency case. In my opinion, the simplest possible is the fact that in the first layer of the convolutional neural network you see that the edge detectors appear fine. You see filters in this W that I talked about earlier in the talk that really try to see if there's an edge like this or that in your image and then those things are combined into the next layer that you know you should try to detect. faces, for example, etc., etc., etc., but we really want to understand why it happens that when you train a convolutional neural network, let's say imagenet, why does this highly structured detector emerge, you know, edge detectors?
And again, this is also too difficult, so let's go one step further and now simplify it into what I would call a canonical mathematical model and try to study it. You really know mathematically what's going on in this canonical mathematical model and the canonical mathematical model should capture the essence of those edge detectors and I think the sparse coding problem that you know does that very, very well, so this is what I'm going for. to explain. Now I am going to tell you what this access coding problem is, why it is like an edge detector problem and then we are going to analyze how gradient descent in a hidden layer neural network is trained in sparse coding, how it learns . those edge detectors, what's happening and you'll see that it's incredibly non-convex, you can't do, you can't do anything kernel, I mean, this is completely irrelevant, okay, so, come on, here's the pass-through encoding, you have a random base V1 to VD in Rd, these are your edges, okay, this random base, you can think of it as your edges, okay, of course the edges of the edges, you know they're not really orthogonal, but you know that this is the first approximation we're going to make and that is we actually have This is not a normal B1 Base through BT and just to simplify, you know, so that I have less words to say, we're just going to assume that this is a base canonical, okay, so when you look in the direction of one of those Edge, we're actually looking at a coordinate, so I've got the canonical base and this is fixed.
Now an input example , okay, so I have some lead with some intensity and the rest will be white noise. Okay, in other words, my model is just And it's going to be a n0 Sigma Square, okay, so Sigma square is really your sign. ratio to noise if Sigma is very large, then you have a very large peak at a coordinate and of course what is your goal, your goal is to get Many of you know noisy coordinates, but then there is one coordinate that is larger than the other and what you have to say is okay, what is the value of that coordinate that is larger than the other?
Okay, very, very simple problem, now of course they don't know it. I don't know the basis, okay, this is a key point, let me say it again, you don't know those edges, the point is that the edges must be discovered, they must be learned, okay, this is a crucial point, okay, so really It seems that this white noise is processed by coordinates but in a rotating manner and you do not know the rotation and we are going to try to learn it. This is a fast coding problem, so keep in mind that if the signal-to-noise ratio Sigma squared is very large if there is really a huge peak, then it is a trivial problem, of course, you can just add all the coordinates, the sum of the coordinates you will have this big peak plus a little bit of noise, you know, so you know that only part of it goes To be a good approximation to the Objective function, what I am saying here is that if the signal-to-noise ratio is large enough, then the linear function solves this problem.
We want to study a problem where linear functions are not good. Well, we want to do it. the real problem is that you really need to be non-linear, non-convex, if the signal to noise ratio is smaller you can't just add or coordinate, what you have to do is look at each coordinate and if the coordinate is small. then you have to set it to zero, you have to remove this noise, you have to do some thresholds, so what you can do is you can still add, you know all the coordinates, but the coordinates after a non-linear operation is applied to it, you just want at the threshold you want to have a filter, so what we'll call the threshold unit will be a unit, so before you know when the realization signal is high, you can just look at the coordinate and add it with the other now a threshold unit, a basic neuron What you need in your network is to look at the coordinate and say if it is smaller then I replace it with zero and if it is large I let it pass, so what? what you can do is just a rail, a rectified layer on a unit with a negative bias, this is exactly what it's going to do, a value with a negative bias, you know, it's exactly going to threshold everything below this bias, he will put it. zero, okay, so what we're going to study is the emergence of the threshold unit, which corresponds to this bias moving towards the negative, okay, this is what we're really going to try to study now, let me say this again in this low signal for noise ratio ratio regime You can predict correctly with high confidence if you use a one hidden layer neural network where the neurons you know the Ws of before are your base element VI and then you have a term of bias B which is exactly set to minus square root 2 log d and Square root 2 log D just because square root 2 log D this is the maximum if I have Gaussian D then the maximum of them will be exactly at square root 2 log D with very high confidence , so what I need to do is simply everything that is in the band between square root 2 log D and minus square root 2 log D.
I set it to zero and whatever is outside I let it pass and this function gets exact good results, so this is the target network, this is actually the function that we are trying to learn okay and you will see that there is a positive part and a negative part. This guy is the threshold of anything that is below the square B and and and or minus B and this guy is the threshold of anything that is above B. so these are the two parts and then you know I have a drawback and I get exactly the answer I want.
Now what we are gonna do. This is our goal. We would like to learn. We would like to learn those edges. These VI. Are our edges okay? We'd like to meet those guys, but what we're going to do is start with a random neural network. We don't know those edge detectors at first, so let's get started. We will have K neurons and we will see the function that is a sum for l is equal to 1 to K of a l. This is just the output coefficient of neuron L and neuron L is not computing anything other than the linear function WL point that if you train on gradient with gradient descent on this access coding problem, when is this network going to converge to something like When are some of those WL filters going to learn the right edge detectors that are going to convert to some of the vi and the BL are going to converge to the threshold that is correct, which is the square root 2? log D and the answer you know depends on the parameters of the problem, there are only four parameters, the dimension, the number of neurons that I am using and what is the number of samples that I am using in my training data and sigma.
I squared up my training, my signal-to-noise ratio, okay, this is it, in my opinion, it's the simplest possible question you could ask in deep learning theory and we have no idea, okay, we don't know and we don't I know, it's okay. I'm not going to give you the answer. I don't know the answer, we don't know the answer to this question, so we don't know when exactly those Edge detectors will be born, but we have some indications that What's going on is really complicated and really difficult and I'll show you what we've understood because we've understood something and I see that maybe I need to speed up a little.
Okay, so let's refine because I just told you what this problem is. too difficult, I'm going to simplify and you're going to say wait, wait, you're oversimplifying now, so I'm going to simplify even more and say, let's say we already know those edges, you know, I told you everything. The point is to learn those edges, let's say I even give them to you, okay, so you know the Vis that they give you, but what you don't know is that you still know the output coefficient, let's say you know you have a unique output coefficient. for all positive neurons and a single output coefficient for all negative neurons and the most important thing is that you have this bias term B, so now I have reduced it to an old three-dimensional problem, it's in Austria, okay, we have three. parameters a plus a minus and B and we'll just run gradient descent on three parameters and see what's happening.
The good thing you know when you're in dimension three is that you can start to have graphs, so we're going to have Plots like this where here's a training loss that you know over time and here's the bias, so this is B, how B is going to change over time and this is what matters, it will be the sum of a plus and a minus, okay? instead of, you know, I only have three graphs, so the training error is the bias parameter B and you know this, the average of the coefficient a plus and a minus, so you see the training error and this is the learning right here, the learning rate is very small and you see that the training error goes to zero and the bias terms do not move, so this is not good, you understand that the bias terms do not move, it means that they do not I see the appearance of threshold units, those units are not thresholding anything. they're looking at whether there's an advantage or not and they're just setting a threshold to determine whether it's positive or negative, but that's not what we want, we want to set a threshold, you know, when it's outside a certain gap, so this bias to zero means that the generalization error is going to be terrible, okay, this means that you are really overfitting, you have managed to train to get a small training error, but the bias has not moved, you did not get a threshold unit, no You're going to generalize, okay, let's do something. let's get a little bit higher learning rate, okay, so this is a little bit higher learning rate, we see exactly the same thing, maybe it gets to zero a little faster, okay, it still doesn't work, here's the crazy stuff that goes to happen.
I'm going to increase the race pace once again and this is what happens suddenly, the training becomes very unstable, a lot of oscillation, not only do you know this in terms of the parameters but also the loss of training, which is something that anyone who has trained in our network has seen. You know that the network training on our network is very unstable, you know that the loss not only decreases monotonically, there is a lot of variation, but now suddenly the bias is also decreasing, so what we are seeing looks like the appearance of this very special problem is the The appearance of threshold units coincides exactly with the moment at which unstable training occurs, which again completely contradicts everything we thought we understood aboutmachine learning.
Typically in machine learning we say that instability is bad. Please remove the instability. You know this will lead to bad generalization. I'm saying that instability is a big instability, it's what actually maybe gives you a generalization and in fact it's even more beautiful than this image, so it's related to this edge of stability. I don't really have time to say what it is, it's a very beautiful thing that was discovered at CMU two years ago and now there are articles coming out almost every week in the archive and you know this is one of them. I don't have time to explain what I want to tell you is a It's that part of this story is even more beautiful than this previous slide, what you think is that there is a threshold phenomenon, there is a phase transition like water, you know, there is a phrase transition, so what we'll look at here again in In this part of the story we focus on the learning rate.
What was varying in my previous slides was the learning rate. Okay, as the running rate increases, suddenly you have more instability and suddenly a threshold unit emerges, so what I give you two plot here on the x-axis is a learning rate. I'm increasing the learning rate here on this one. I'm looking at the bias term and on the right side I'm looking at generalizability, test precision, and you. At first it's a test accuracy, it goes flat, it doesn't improve and then suddenly at some point, at some level, see here, I'm zooming in on this region at some level which is given by this green line, of suddenly it begins. to increase the accuracy of the generalization and it turns out that this green line is exactly the moment when the buyer starts to move, which is what I told you.
You know that you will only get a good generalization if you get the threshold units to emerge, but they happen at exactly the same time and not only that, but it is a threshold phenomenon for a long time nothing happens and then suddenly with the correct profit rate starts to move, not only that, but we were able to prove it and now this is a formal mathematical theorem after two more. approximation steps that I'm sure we can address, but you know it's already complicated enough like this, it's the emergence, you know, this green line this green line is exactly 8 pi over D squared, so this is the moment at 8 pi on D Squared for Sparse Coding with a Single Hidden Layer Neural Network.
This is the emergence of threshold units in terms of learning rate. This is the time when the boundary between a small learning rate and a large execution is okay, so like before, you know, we had uh um, like before. We showed that there was an inductive bias due to the diversity of data. Here we have an inductive bias due to large learning rates. Okay, push.to run this useful structure um uh I started five minutes late Can I take two more minutes to finish this story? Yeah, okay, so this is the last slide and then there's the conclusion, so here I just want to say because at this point, you know, it's an i, it's the point that I usually don't like in conversations where it's very mysterious. , but it's actually very simple.
What's happening. What is happening is like this. You have to think about the optimization problems we are talking about. about you know you remember there was this output coefficient a and there's this biostore B, it's really like there's a convex loss function a that applies about eight times to some nonlinear function G of B, it's a nonlinear function G of B, it's all the neurons that you know that combine with the bias B, etc., it's a little difficult, it's a little difficult to think about it like that, but generally speaking, that's the intuition, okay and this function G of B you know when B becomes negative when we move the bias so since G of B goes to zero okay this is because if I have a value and B goes to minus infinity then you know you set a threshold for everything , so let's study now.
I'm going to show you a picture of gradient descent in the simplest way possible. non-convex problem, which is I have a convex function, but instead of just taking L of , then n of x by y. Now I'll give you a graph that I have, I mean, dimension two, X and Y, are real numbers, okay, let's see what the gradient distance does in a convex function like L of show the graph, understand if the minimum is at zero I have the y axis and I have the x axis these are the minimum of this function L of Flatter lows, you know there's more space where things are small and as I move up the y-axis or towards the x-axis, you know it gets sharper.
Minimums because as soon as I come off this axis, then then the loss increases much faster, so this is a picture that you need to learn, what happens is this, so let's look at the left side so you have two regimes, you have the regime of gradient flow and the edge of the stability region, so the gradient flow regime is when you do very well and you descend with a very small step size and what happens is, of course, that X and Y they're both decreasing and they both want to get to zero, but at some point you know maybe you know that X gets to zero first and then snaps there, you're there, you're fixed, but what happens if you do gradient descent?
If you do gradient descent to go towards this, you know the y axis, but then you lose it a little bit and then you start bouncing around and as you bounce around, you go down, so this bouncing is exactly the inductive bias that gives you the why go down, so why go down is the G of B going down, which means B goes to minus infinity, so this bounce is this stability edge? the fact that you haven't converged because you have too large a step size is exactly what gives you an inductive bias to slow down your learning rate and let me tell you that you know this image this is our real experiment this is not you know a toy uh uh Image This is actually what happens this is exactly what happens right and you can see we can even predict exactly where you're going to end up on the y axis so with that let me conclude and I don't know.
If we are asked questions, hopefully, but maybe not so well, this sentence is the most important in the presentation. I think, although you know, I can't fully explain it, but a miracle has really happened, intelligence has emerged, you know, you can see it. in charge of GPT is just that it is changing the world why what we propose is this physics-based approach based on physics methodology based on controlled experiments and mathematical toy models. We've seen two Lego things that you know the diversity of data. you to an inductive bias towards a useful structure, this is contradictory to traditional IID training and testing ml, you know, I'm saying you shouldn't have the same distribution in training and testing, you should have a different training distribution and more complicated training distribution, then we saw an example of a toy mathematical model which was spatial coding analysis and there we saw that this instability had an inductive bias towards the emergence of threshold units, you know, but again it's contradictory with the usual theory as here. instability is bad and really the key point is that I can't believe those experiments.
I mean, those experiments look like. All of this seems too good to be true in my opinion. Even you know this inductive bias due to oscillation. There should be a more general series, there are more general principles at play. I mean, it seems like too good a coincidence, but I certainly can't see this more general principle right now and I think the problem is that I don't have enough evidence, enough controlled experiment, you know, the toy environment where we understand exactly what is happening to try to build this more general picture, so you know, let's do it, thank you.

If you have any copyright issue, please Contact