Geoffrey Hinton Unpacks The Forward-Forward Algorithm

Mar 14, 2024

When seeing a pink elephant, notice that the words pink and elephant refer to things in the world, so what is really happening is that I would like to tell you what is going on inside my head. Hi, I'm Craig Smith and this is me, Jeffrey Hinton, a pioneer in neural networks and the man who coined the term deep learning has been driven throughout his career to understand the brain well. His application of the error backpropagation

algorithm

to deep networks triggered a revolution in artificial intelligence that he doesn't think explains how. The brain processes information late last year introduced a new learning

algorithm

he calls the feed

forward

algorithm that he believes is a more plausible model for how the cerebral cortex might learn.

A lot has been written about the

forward

algorithm in recent weeks, but here Jeff gives us a deep dive into the algorithm and the journey that led to it. Conversation is technical and assumes a lot of knowledge on the part of the listeners, but my advice to those who don't have that knowledge is to leave the technical stuff behind. Wash up and listen to Jeff's ideas before you begin. I would like to mention our sponsor clearml, an end-to-end open source ml op solution. You can try it for free at clear.ml, that's c-l-e-a-r dot ml. a I sent you now, here's Jeff.

More Interesting Facts About,

geoffrey hinton unpacks the forward forward algorithm...

I hope you find the conversation as fascinating as I do. Strange to listeners who forward networks and why you're looking for something beyond backpropagation despite its tremendous success. Let me start by explaining why I don't believe. the brain is making population backwards. One thing about backward propagation is that you need to have a perfect model of the forward system that is impact propagation. It's easier to think of for a layered network, but it also works for recurrent networks. For a layered network you do a forward pass where the input comes in at the bottom and goes through these layers, so the input can be pixels and what comes out at the top can be a classification of whether it is a cat or a dog.

Move through the layers and then look. the error in the output if you say cat when you should say dog that is wrong and you would like to figure out how to change all the weights in the forward pass so that next time you are more likely to say the correct category instead of the wrong one, for so you have to figure out how a change in a weight would affect how much it gives the correct answer and then you want to go out and change all the weights in proportion to how much they help get the correct answer and backpropagation. is a way of calculating that gradient, we're calculating how much a change in weight would make the system have less error and then you change the weight in proportion to how much it helps and obviously if it hurts you change it in the opposite direction now about the Propagation seems like a step forward but it goes backwards.

It has to use the same connectivity pattern with the same weights but in the backward direction and it has to go backwards through the nonlinearity of the neuron. There is no evidence that the brain is doing that and there are many elements that are not doing it, so the worst case scenario is if you are doing backward propagation on a recurrent network because then you run the recurrent network forward on time and generates a response at the end of the forward execution. in time and then you have to run it backwards in time to get all these derivatives, so I had to change the weights and that's particularly problematic if, for example, you're trying to process a video and you can't stop and go back in time . time is combined with the fact that there is no evidence that the brain does it well, there is no good evidence, there is the problem that just for technology it is a disaster, it disrupts the pipeline of things, so you would really like Something like video, there have been multiple stages of processing. and you would like to just funnel the inputs through those multiple stages and keep funneling them, so the idea of Ford's algorithm is that if you can split the learning, the process of getting the gradients you need into two separate phases, you can do one. of them online and one offline and the way you do it online can be very simple and will allow you to just pipe things so that the online phase that corresponds to waking up enters the input to the network and let's take the input of recurring version keeps coming to the network and what you are trying to do for each layer at each time step is trying to make the layer high activity or, rather, high enough activity that you can discover that This is real data.

So the underlying idea is that for real data you want each layer to have high activity and for fake data what comes out we get that then you would like each layer to have low activity and the network task or what you are trying to achieve. It is not about giving the correct label, since back propagation tries to achieve this property, but you can differentiate between real data and fake data in each layer, since each layer has high activity for real data and no activity for fake data, so each layer The layer has its own objective function, in fact, to be more precise, we take the sum of the squares of the activities of the units in a layer, subtract some thresholds, and then feed them to a logistic function that simply decides what is the probability that this is real data rather than fake data and if the logistic function receives a lot of data it will say that it is definitely real data so there is no need to change anything if it receives a lot data, you won't learn on that example because you're already doing it right and that explains how you can run a lot of positive examples without running any negative examples, which is fake data because you'll just get saturated with positive examples.

It's working well, so that's what it does in the positive phase you're trying to get. high sum of squared activities in each layer so you can distinguish high enough to know that it is real data in the negative phase running offline i.e. during sleep mode the network needs to generate its own data and try to get your own. data as input, you want to have low activity in each layer, so the network has to learn a generative model and what you are trying to do is discriminate between real data and fake data produced by your generative model, obviously if you can't discriminate at all, then what's going to happen is that the derivatives you get from the real data and the derivatives we get from the fake data will be equal and opposite, so you won't learn anything, learning will be over if you can't distinguish between what it generates. and the real data, this is very similar again if you know generative adversarial networks, except that the discriminative network that tries to differentiate between real and fake and the generative model that tries to generate fake data use the same hidden units and so they use the The same hidden representations that overcome many of the problems that a weapon has, on the other hand, because it does not perform backpropagation to learn the generative model, it is more difficult to learn a good general model, which is a rough general description of the algorithm.

Let me ask you a couple. of questions about the sleep-wake cycle, are you rapidly alternating between them? Okay, so most of the research what I would do is cycle preliminary research quickly between them because that's the most obvious thing and I found out later that I already knew that. For some time now, learning contraceptives has allowed you to separate the phases, and later I discovered that it worked quite well to separate the phases. Recent experiments I've done with character prediction. You can predict, you can have it predict about a quarter of a million characters, so it's running real data trying to predict the next character it's making predictions, it's running with mini batches, so after making a large number of predictions, will update the weights and then you will see more positive examples, it will update the scan, so in all those phases you are simply trying to get higher activity in the hidden layers, but only if you don't already have high activity and it can be predicted as a quarter of a million characters in the positive phase and then switch to the negative phase where the network generates its own string of characters and if you are now trying to get little activity in the hidden layers for the characters, predict that they will look like little characters of window and then you run through a quarter of a million characters like that and it doesn't really have to be the same. number we have already bought some machines it is very important to have the same number of things in the positive phase and in the negative phase but with this the most remarkable thing is that even a few hundred thousand predictions works almost as well if you separate the phases in rather than interleaving and that's pretty amazing in human learning, we can certainly sleep through complicated concepts that you're learning, but there's learning going on all the time that doesn't require a sleep phase, well there is in this.

Also, if you are only running positive examples, you are changing the weights of all the examples where it is not completely obvious that this is positive data, so it will do a lot of good and you will learn a lot in the positive phase, but if you continue if you last too much, you fail catastrophically and people seem to be the same. If I probably sleep for a week, you'll go completely psychotic and have work hallucinations and you may never recover. You can explain? I think one thing is that people are having problems, non-practitioners. having trouble understanding is the concept of negative data.

I've seen some articles where they just put it in quotes in their article, indicating that they don't understand it. Well, what I mean by negative data is data that you provide. the system when running in the negative phase i.e. when you try to get low activity in all hidden layers and there are many ways to generate negative data, in the end you would like the model itself to generate the negative data, so this is Just like in the Baltimore machines, the data that the model itself generates is negative data and what you are trying to model is real data and once you have a really good model, the negative data looks like the real data, so there is no loan. takes place, but the model doesn't have to produce negative data, so for example you can train it to do supervised learning by inputting both an image and the label, so now the label is part of the input, not part of the output and what you What I'm asking you to do is that when I enter an image with the correct label, that will be the positive data, you want to have high activity, you want to enter an image with the wrong label that I just put to hand, that's the wrong one. an incorrect label that is negative data now works better if you have the model predict the label and put in the best of the model's predictions, it is not correct because then you are giving it the things that it is most likely to make the mistake. as negative data, but you can enter negative data by hand and it works fine and the reconciliation at the end is like in the Boltzmann machines where you are subtracting the negative data from the positive data, but in both machines what you do is you give positive data, data real and you let them come to equilibrium, which you have nothing to do with the forward algorithm, well, not exactly anyway, and once I started equilibrium, you measure the pairwise statistics, which is the frequency with which that two units are connected they are together and then in the negative phase you do the same thing with things, you just let the model sit as if it produced data and you mentioned the same statistics and you take the difference of those statistics pairwise and that is the signal correct learning process for a basketball machine, but the problem is that you have to let the model settle, yeah, and there's just no time for that, plus you have to have all kinds of other conditions, like the connections have to be symmetrical, there is no evidence.

Connections in the brain can be symmetrical. you give a concrete example of positive and negative data in a very simple learning exercise. I was working with digits in this example. I think if you are predicting a string of characters, the positive data would see a small character window and it should have some hidden letters and because it is a positive character window, you try to make the activity high in all the hidden layers, but also from those hidden layers of the activity in those hidden layers you would try to predict the next character, which is very simple. geometry model, but notice that the geometry model does not have to learn its own representation, so the representations are learned just to make positive strings of characters give you high activity on all hidden notes, that is the goal of learning, the goal is not to predict the next character, but once learned, you have the correct representations for these character strings, these character windows, you also learn to predict thenext character and that's what you're doing in the positive phase by looking at the character windows you're changing. the weights so that all the hidden layers have high activity for those windows or characters, but you are also changing the weights from top to bottom that try to predict the next character from the activity in the hidden guys, that's what is sometimes done called a linear classifier, so that is the positive face in the negative phase, as input you use characters that have already been predicted, so you have this window and you continue predicting the next character and then you move the window along one to include the next character you predicted. and to leave the oldest character, you just continue like this and for each of those frames you try to get little activity on the hidden wires because it's negative data and I think you can see that if your predictions were perfect and you start from a chain, a real chain, then what will happen in the negative phase will be exactly the same as what happens in the positive phase, so the two will cancel each other out, but if there is a difference, then you will learn to make things more like the positive phase and less to the negative phase, so it will be increasingly better at predicting.

I understood that back propagation on static data there are inputs there is an output and you calculate the error and then you run backwards through the network and correct the weights and then you do it again and that is not a good model for the brain because there is no evidence that information flows backwards through neurons. Not exactly the right wish. There is no good evidence of information derived from studies, these error gradients flow backwards. Okay, obviously the brain has top-down connections. If you look at the perceptual system, there is a sort of forward direction that goes from the thalamus to it to get a temporal cortex. where you recognize things and the thalamus is kind of a place where the information comes from the eyes and there are connections in the backward direction, but the connection in the backward direction is nothing like what you would need for backward propagation, For example. in two cortical areas the return connection does not go to the same cells that the forward connections come from, it is not reciprocal in that sense, yes there is a loop between the cortical areas, but the information in one course area passes through of about six different neurons before it goes back to where it started and then it's a loop, it's not, uh, it's not like a mirrored system, okay, but my question is, are you talking about turning the static image into a boring video that you It allows you to have top-down effects, that's correct, yes?

So you have to think in a forward direction that goes from the bottom layers to the top and then orthogonal to that was the time dimension, so if I have a video, even if it's a video of a single thing standing still, I can go up and down through the layers as I go through time and that's what allows you to have top down effects. Well, I understood that yes, each layer can receive inputs from a higher layer in the previous time step exactly, yes, so which layer. what it is doing is receiving information from upper and lower layers in the previous time step and from itself in the previous time step and if you have a static input, the whole process over time looks like a network settling, which looks a little more like Baltimore. the machine calms down and the idea is that the time you are using for that is the same time you are using for posting videos and because of that if I give you quick information that is changing too fast you will never be able to settle on it so I discovered this nice phenomenon.

If you take a new regularly shaped object, like a potato, for example, a nice irregularly shaped potato, and you throw it into the air, slowly rotating it at one or two revolutions per second, you can't see what shape it has. you just can't see its shape, you don't have time to settle for a 3D rendering because it's the same time steps that you're using to post videos that you're using to settle for a static image and what I found fascinating and maybe this is something that's already in the literature, but this idea of going up and down the layers as you go through time has always been in Recurrent Networks, so to start with recurrence, we only have one hidden layer. so typical lstms and such would have one hidden there and then Alex Graves had the idea of having multiple hidden layers and he showed that it was a winner, so that idea has been around but it's always been combined with backpropagation as an algorithm of learning and in that case it was a backward propagation through time, which was completely unrealistic, but the real life of the brain is not static, so you are not perceiving in a truly static way how much This arose from Sinclair's contrast between learning or activity differences among final graduates a couple of years ago.

I got really excited years ago because I was trying to make a more biologically plausible version of things like Sim. Of course, there are a lot of simple things, it just wasn't the first of them, in fact, it's something a little simpler than Sue Becker. and I published around 19, 1992 in Nature, but we didn't use negative examples, we tried to calculate the negative phase analytically and that wasn't a mistake, that would just never work, once you start using negative examples you get things like just and I found out that the phases could be separated that they could not and that excited me a lot a few years ago because it seemed that I only had one explanation of what it was like to sleep for a big difference is simply taking two different Patches of the same image and if they are of the same image, try to make them have a similar representation.

If they are from different images, try to make them have different representations. Different enough once they are different, it doesn't try. and make them more different and when you think, how to say this simply involves looking at two representations and seeing how similar they are and that's one way of measuring agreement and in fact, if you think about the squared difference between two vectors that decomposes into three terms the sun has to do with the square of the first vector there is something to do with the square of the second vector and then there is the dot product of the two vectors and the dot product of the two vectors is the only type is the only interactive term and it turns out that the squared difference is very similar to a scalar product, a large squared difference is like a small scale product.

Now there's a different way to measure agreement, which is to take the things you'd like to agree to and combine them into one. set of neurons and now, if two sources that enter that set of neurons are green, you will get high activity in those neurons, it is like positive interference between light waves and if they do not agree you will get low activity and if you measure the agreement only by the activity in a layer of neurons you are measuring a agreement between the inputs so you don't have to have two things you can have as many things as you want you don't have to divide the input into two patches and tell the The representation of the two patches is of agreement.

You can simply say: I have a hidden letter. Does this hidden layer become very active? I think it's a better way to measure agreement. It's easier for the brain to do it and it's particularly interesting if you have firing neurons because what I'm currently using doesn't use Spike Insurance, it just says that a hidden layer actually asks if my inputs match each other, in which case I'll be very active or they're not. okay, in which case I won. No, but if the inputs come at specific times, very precise times, like spikes do, then you can ask not only other stem neurons that are being stimulated, but whether they are being stimulated at exactly the same time and that's a much better way. more accurate at measuring agreement, so the neurons that are being stimulated seem particularly good at measuring agreement, which is what I need, that is the objective function to achieve agreement in the positive phase, not in the negative phase, and I'm thinking of ways to try to implant enhancer neurons in you to make this work better, but that's a big problem.

The difference with the simplest thing is that you're not taking two things and saying whether they agree, you're simply taking all the inputs that go into a layer and saying whether all those inputs agree when you talk about the activity that is similar to what what were you doing with n Graduates where you compare top-down predictions and bottom-up predictions, okay, okay, this when you do the recursive version of the forward algorithm at every moment, the step neurons in a Larry get the input from top down and bottom up input correctly and they I would like them to agree and if their objective function is to have high activity, they would like to make things highly active.

There is another version of the direct algorithm where the goal is to have low activity and then you want the top down to cancel out the bottom up and then it looks a lot more like predictive coding, it's not exactly the same but it's very similar, but let's stick with the version where you're looking for high activity, you want top down and bottom up to agree and give you high activity, but note that it's not that top down is a derivative , so in attempts to implement post-trimming in neural networks, you try to have top-down things that are like derivatives and bottom-up things that are like activities, and you try to use temporal differences to give you the derivatives and that's something different here all the activities, they are never propagated derivatives and this algorithm also eliminates the idea of dynamic routing that you talked about with yes, stacked capsule encoders, yes, yes, so with capsules I went from dynamic routing to Having the What is called universal capsules would be a small collection of neurons and in the original capsule models that collection of neurons could only represent one type of thing like a nose and a different type of capsule would represent a mouse in the universal capsules. what you would have is that each capsule could represent any type of thing, so you would have different patterns of activity to represent the different types of things that could be there.

The capsule would be dedicated to a location in the image, so a capsule would represent. what kind of thing you have in that place at a particular level of anus hierarchy, so it could represent to you that at the partial level you have a nose um and then at a higher level you would have other capsules that represent others at the object level you have a face or something like that, but when you get rid of the dedication of a group of neurons to a particular type of thing, you no longer need to do routing and in the forward fold algorithm I'm not doing routine and one. of the diagrams in the product article are actually taken from my article on pothole hierarchies, my last article on capsule models, so I had a system called glom, an imaginary system, and the problem with it was that I never had a plausible learning from it and the thinking algorithm is a plausible learning algorithm for glom is something that is neurally reasonable what was fascinating to me, at least about the capsules, is that they captured the 3D nature of reality, many neural networks now they're doing that, so Nerf models are now models of neural regions. giving you very good 3D models in neural networks so you can see something from different points of view and then produce an image of what it would look like from a new point of view, which is very good, for example making smooth videos from frames that take quite a bit of time. time intervals, but in the forward forward algorithm, what is your intuition?

If it really all works, this is a model for information processing in the cerebral cortex and that depth perception and the three-dimensional nature of reality would emerge. Yes, yes, in particular. If I'm showing you a video and the point of view changes during the video, then what you would want is for the hidden layers to represent a 3D structure that is quite a pie in the sky right now, go ahead and get to that stage, but yeah , but with capsules because I think you meant that pixels have depth, so that if one object moved in front of another, the system understood that it was behind what was in front of it.

Do you capture that with trailer? You would want me to learn to deal with that. Yeah, yeah, I wouldn't include that, but it's an obvious highlight reel that you should learn with babies, they learn in just a few days to get structure from movement, meaning if I take a static scene and move the Observer or if I Hold the Observer stationary and the experiments were done with a sheet of paper folded into a w shape and if you see it upside down it looks strange so the experiments done by Elizabeth Stokey and other people use the idea that a lot can be said about the perception of a baby seeing what interests him because he is interested in things that seem strange, so he will pay morePay attention to things that seem difficult and in a few days you will learn to deal with what the 3D structure should be like. be related to movement and if you relate it wrong, they think it's strange, so they learn it very quickly, whereas it takes them at least six months.

I think learning to do stereo to get it from real eyes is much easier. get it from video than stereo, but from an evolutionary point of view, if something is really easy to learn, there isn't much of it. Point of wiring it. You've been working in Matlab famously now on toy problems. Are you starting to climb? Are you still refining? I'm doing a bit of scanning, I'm using a GPU to make these go a bit faster, but I'm still at the stage where there are very basic properties of the algorithm I'm exploring, in particular how to effectively generate negative data at starting from the model and up I have the basic type of things working very well.

I think it's silly to expand them as soon as you expand them. It's slower to investigate changes to the basic algorithm and I'm still at the stage where there are a ton of different things. things I want to investigate, for example, here's just one small thing I haven't had time to invest in yet: you can use as an objective function to have high activity in the positive phase and low activity in the negative phase, and if you do that you will find interesting features in the hidden units or may have a zero objective function to have low activity in the positive phase.

If you do, you'll find interesting limitations if you think about what physicists are trying to understand. nature by finding seemingly different things that add up to zero, another way of saying is that they are equal and opposite, but if you take the force and subtract the mass by the acceleration, you get zero, but that's a restriction, okay, so if you have two types of information one of which is force and the other which is mass times acceleration. You would like to have hidden units that see both inputs and say zero with no activity and then when they see things that don't conform to physics, they will have high activity, they will be negative data, so it's called constraint and therefore if makes your objective function B have low activity for real things and high activity for things that are not real, you will find constraints on the data instead of features. so features are things that have high variance and constraints on things that have low variance, a feature, something that has higher variance and should have been restricted as low as it should be, now there's no reason why you shouldn't have two types of neurons that one is looking at. for features and one looks for constraints and we know with just linear models that a method like principal component analysis looks for the directions in space with the highest variance, they are like features and it is very stable, there are other methods like minor component analysis that are looking for the directions in space that have the lowest variance, they are looking for constraints, they are less numerically stable, but we know that it is worth having both and therefore that, for example, is a direction that could make things work better, but there are many things about it. 20 things like that I need to investigate and my feeling is that until I have a good recipe on whether to use features or constraints or both, what is the most effective way to generate negative data, etc., it is premature to investigate really large systems with Respect to really large systems, one of the things you talk about is the need for a new type of computer and I've seen confusion about this in the press as well.

I've seen people talk about how they talk about getting rid of the annoying, yeah. Obviously you want computers where the hardware and software are separate, yes, and you want them to do things like keep track of your bank account. This is for things where we want computers to be like people, process natural language, process vision, all those things that A few years ago, Bill Gates said that computers couldn't act like they were blind and deaf, no longer. They are blind and deaf, but to process natural language, control motors, or reason with common sense, we probably want a different type of computer if we want to make very low power use, we need to make much better use of all the properties of the hardware. .

His interest is to understand the brain well. I have a secondary interest in getting low energy computing going and the point about moving forward is that it works when you don't have a good hardware model, so if, for example, I take a neural network and insert a black box, I have a layer that is just a black box. I have no idea how it works, it does stochastic things. I don't know what's going on, the question is: can the entire system learn with that black box in there? It has absolutely no problem. You've done something different because the black box is changing what happens on the forward pass, but the point is that it is changing. exactly the same way for both forward passes, so everything cancels out, while in backward propagation you are completely sunk in this back box, the best you can do is try to learn a differentiable model of the black box and that It won't be very good.

If the black box wanders in its behavior, then the forward algorithm does not need to have a perfect model of a forward system, it needs to have a good enough model of what a neuron is doing so that it can change the incoming weights of that neuron to make it more active or less active, but that's all you need, you don't need to be able to reverse the forward pass and we're not talking about replacing back propagation, which has obviously been hugely successful, there's a lot of computing, a lot of power. then the subsequent clipping is fine, but this is speculative.

I understand where you are in the research, but can you imagine if you had a low-power computer architecture that could handle Ford's algorithms and scale them? Imagine that would be great. In fact, I've been talking to someone named Jack Kendall who works for a company called Rain who is very insightful about what can be done with analog hardware using properties of circuits using words for positive circuits of electrical circuits natural properties of circuits. electrical circles um initially that was very interesting for doing a form of Baltimore machine learning, but it's also going to be very interesting for the direct algorithm, so I imagine it's scaling up very well, but there's a lot of work to do to make that happen, and if it scaled up very well To the extent that large language models have been successful, do you think their capabilities would eclipse those of models based on back propagation?

I'm not entirely sure, I think it's possible not, so I think backpropagation might be a better algorithm in that sense. that for a given number of connections, you can get more knowledge about those connections using backpropagation than with the thinking algorithm, so the network advances better if they are somewhat larger than the best sized networks for backpropagation , it's not good for squeezing a lot of information into a few connections, backpropagation will squeeze a lot of information into a few connections if you force it. It's much happier not to have to do that, but it will if you force it and the entire algorithm isn't.

It's not good at that, so if you take these big language models, take something with a trillion connections, which is about the largest language model, that kind of size is about a cubic centimeter of cortex and our cortex is like if we had a thousand times that amount of bark. So these great language models who actually know a lot more facts than your ideal because they've read everything on the web, not everything, but a lot, yes, the sense in which they know them is a little iffy, but if you had a kind of general. knowledge quiz I think gpg3 would even beat me on a general knowledge quiz.

There would be all kinds of people you know and when they were born and what they did, but I don't know and it all fits into a cubic centimeter. cortex, if measured by connections, has much more knowledge than I do, I mean, much less brain, so I think the back is much better at squeezing out information, but that's not the main problem with the brain, large brains, we have many synapses, the question is how are information provided to them effectively? How do you put experience to good use? David Chalmers talked about the possibility of Consciousness and he is certainly interested in the possibility if he understands how the brain works and can replicate it in this type of model.

Imagine it scales beautifully. Do you see the potential for reasoning? and oh, I see the potential for reasoning, sure, but consciousness is a different kind of question, so I think it surprises me that people think they understand what they're talking about when they talk about consciousness, they talk as if we can define it and it's really a mix of a bunch of different concepts, yeah, and they're all mixed up in this attempt to explain a really complicated mechanism in terms of an essence, yeah, so we. I've seen that before, like 100 years ago, if you ask philosophers what makes something alive or even if you ask biologists what makes something alive, they say, Well, it has life force, but if you ask what It's the life force and if we can make machines have life force, they I can't really define life force other than to say it's what makes people live and as soon as you start understanding biochemistry you abandon the notion of life force. life force and you understand biochemical processes that are stable and things that break down, so it's not that we stop having life force, we have as much life force as we had before, it's just that it's not a useful concept because in an attempt to explain something complicated in terms of a simple essence, another model like that is for sports cars to have oomph. and some have a lot of them, like an Aston Martin with big loud exhausts and lots of acceleration and bucket seats, it has a lot of them and the month is an intuitive concept, you may ask, doesn't Aston Martin have more oomph than my Toyota Corolla and it definitely ? it's got a lot more oomph so we really need to figure out what oomph is because umph is what it's all about if you're interested in cars or fast cars anyway but the concept of umph is a perfectly good concept but it's not like that.

I really explain a lot, but if you want to know why when I press the accelerator it goes very fast, the concept of thrust is not going to help you, you have to delve into the mechanics, how it actually works and that is a good analogy because what I was going to say is that It doesn't really matter what Consciousness is, it matters if we as humans perceive that something has Consciousness and I think there is a lot to say, yes, yes. If this moves forward into a large model that scales relatively low power consumption, if you can reason, there will always be philosophers who will say yes, but it's not conscious, but it doesn't really matter if you can't tell the difference.

I think it would be good to show philosophers the way out of the trap they have set for themselves, which is to say that I think most people have a radical misunderstanding of how the terms perception, experience, sensation, and feelings really work. . I have known the language. It works if, for example, I say I'm seeing a pink elephant. Note that the words rose and elephant refer to things in the world. So what's really happening is I would like to tell you what's going on inside my head, yes, but tell you what neurons are. What you're doing won't do you much good, especially since all our brains are wired slightly differently, it's just no use telling you what the neurons are doing, but I can tell you that whatever my neurons are doing , it's the kind of thing that's usually caused by the Pink Elephant being out there.

If I were making a truthful perception, the cause of my brain state would be a pink elephant. I can tell you that and that doesn't mean there's a pink elephant in some creepy thing inside my head or it's just a mental thing, what it really tells you is that I'm giving you a counterfactual. I'm saying that the world doesn't really contain a pink elephant, but if it did contain a pink elephant, that would explain my brain stage plus normal perceptual causality. I will explain my brain stage, so when I say that I am having a pink elephant experience, the word experience, many people think that experience refers to some funny internal events, it is an experience, it is something internal, not what I denote when I use the word.

The experience is that it is not real. I guess I'm giving you a hypothetical statement, but if this hypothetical thing were in the world, that would explain this brain state, so I'm giving you an idea of my brain state by talking about a hypothetical world which is not real about the experience is that it is a hypothetical world that I am giving you, it is not that I live in some other creepy world and the same goes for feelings, if I say I feel like hitting you, IWhat I'm doing is 'I'm giving you an idea of what's going on in my head through what it would normally cause, so in perception it's the world that causes a sexual state with feelings, it's the internal state that causes an action and I'm giving you an idea of my internal state by telling you what kind of action it would cause now I might feel like hitting you or anyone else or kicking the cat or whatever, in which case, instead of giving you any of those actions, I just use a term like angry, but it's actually an abbreviation for All those angry actions, so I'm giving you a way to see what's going on in my head by describing actions I could do, but they're just hypothetical actions and that's what the word feel means when I say I feel. normally, if I say I feel and then I say I feel blah, it's not that there is a special inner Essence that I am feeling and computers don't have it, computers are just transistors, they don't have feeling, you have to have a soul to have feeling or Not something, I'm describing my internal state through the actions it would cause if I disinhibited it from the point of view of another human being, if you were a machine and said things like that, I would perceive it as if you had the right feelings, so come on.

Take the cases of perception, it's a little simpler. I think let's say we create a large, complicated neural network that can generate perception and also produce language. We have them now, yeah, so you can show them for a minute and they can give you a description of what's there and let's assume. now we take one of those networks and we say: I want you to imagine something and okay, so imagine something and then it tells you what it's imagining and it says I'm experiencing a pink elephant who is experiencing the Pink Elephant as much as a person is when they say they experience something elephant, has an internal perceptual state that would normally be caused by a pink elephant, but in this case it is not caused by a pink elephant and that is why it uses the word experience to denote that there you have it.

I think it has as many perceptual sensations as we do, although in the current state of the big language models they don't exhibit that kind of cohesive internal logic, you know, but they will, you think they will, oh yeah, yeah, I don't think so. I don't think that consciousness is people who treat it as such. like the sound barrier, that you are either below the speed of sound or you are above the speed of sound, or you have a model that does not yet have Consciousness or you have there, it's not like that at all. I think a lot of people were impressed that you talked about using Matlab.

I'm not sure impress is the right word. They were interested. They were surprised, but what is your daily job? You have other responsibilities but you dedicate more time to them. conceptualizing and that could happen while you're walking or you're taking a shower or you're spending more time experimenting like in Matlab or you're spending more time running big experiments. Well, it varies a lot over time, so I often spend a lot of time. like when I wrote that article about glom, I spent a lot of time thinking about how to organize a perceptual system that was more neurally realistic and could deal with bumpy hierarchists without having to make dynamic configurations and connections, so I went through many months just thinking about how to do that and write an article about it.

I spent a lot of time trying to think of more biologically possible learning algorithms, yes, and then programming little systems in Matlab and figuring out why they don't work, so the point of most of the original ideas is that they are wrong and matlab is very convenient to quickly show them wrong and very small problems with toys, such as recognizing handwritten digits. I am very familiar with that task. I can quickly test an idea to see if it works and I probably have thousands of programs on my computer that didn't work well and I programmed them in one afternoon and one afternoon was enough to decide okay, that's not going to work, that's probably not going to work. to work, you never know for sure because there may be some little trick that you didn't think of and then there will be periods where I think I have found something that works and I will spend several weeks programming and running things to see if it works, yes I have made.

I've been doing that recently with the front Ford. Let me see why I use Matlab. I learned many languages when I was young. I learned pop two, which was an Edinburgh language. UCSD Pascal. A common lisp scheme. All kinds of lisps and vanilla Matlab, which is ugly. somehow, but if you're dealing with vectors and matrices, it's what you want, it makes it convenient and I learned Matlab fluently and I should have learned Python and I should have learned all kinds of other things, but when you're older you're a lot slower to learning languages and I had learned a lot of them and I thought since I speak Matlab fluently and I can try out small ideas in Matlab and then other people can try out how to run your own big systems, I would just stick with Testing things in Matlab, there are a lot of things that they literally molded me, but it's also very convenient and there's a lot of talk about learning in young children.

Is that knowledge base something you accumulated years ago or do you continue to read and talk to people in different fields I talk to a lot of people and I learned most things by talking to people. I'm not very good at reading it. I read very slowly and when I get to equations they slow me down a lot, so I've learned most of what I know from talking to people and I'm lucky that there are just a lot of good people to talk to, like I talked to Terry Sonoski and he tells all kinds of neuroscience things.

I talked to Josh Tenenbaum when he tells me all kinds of things. about cognitive science stuff I talked to James Howell and he tells me a lot of science and psychology stuff, so I get most of my knowledge from just talking to people, your target nerves you mentioned, yeah, he corrected my pronunciation of his name, Look Khan, why did you do it? reference it in that talk, oh, because for many years you were pushing convolutional neural networks, oh, okay, and the vision community said okay, they're okay for small things like handwritten digits, but they'll never work for real pictures and famous item was sent. to a conference where he and his co-workers actually performed better than any other system on a particular benchmark.

I think it was segmenting pedestrians, but I'm not really sure it was anything like that and the paper was rejected even though it had the best results and one of the referees, so the reason they rejected the paper was because the system learned everything so it didn't teach us anything about vision and this is a wonderful example of a paradigm and the paradigm for computer vision was to study the task that the calculation needs to be done, you discover an algorithm that will do that calculation and then you figure out how to implement it efficiently, so the knowledge is all explicit, the knowledge that is used to make the vision is explicit.

They have to solve it mathematically and then implement it and sit there in the program and just assumed that's the way computer vision is going to work and because computer vision has to work that way if someone comes along and just learns everything to that they're of no use to you because they haven't said what the knowledge is, what the heuristic is that you're using, so okay, maybe it works, but that's just good luck, in the end we're forced to work better than that because We're using real knowledge, shouldn't we understand what's going on, so they missed the main message, which was that it learned everything, not everything because you're writing convolution, but the machine learning community respected it because it's obviously a guy intelligent? but they thought he was on the completely wrong path and they discarded his work for years and years and then when Fife Lee and his collaborators produced the Imagenet competition we finally had a data set large enough to show that neural networks would actually work well and Jan I actually tried to get several different students to make a serious attempt at doing image network convolutional networks, but couldn't find a student who was interested in doing it at the same time that Elia was very interested in doing it and I was interested in doing it. and Alex Fishevski was an excellent programmer who worked hard to get it to work really well, so it was very unfortunate for Yan that it wasn't his group that finally convinced the computer vision community that this actually works much better than...

What are you doing? You have now published this document. Are you hoping to start some sort of army of people trying to publish simple Matlab code? Yes, because there are a lot of little things you have to do, otherwise. won't work and the code has to get there, it's more demanding than backpropagation backup, you just show people the equations and anyone can implement it and you don't need a lot of tricks to make it work quite well. very good, it needs a lot of tricks, but it has worked pretty well, it's okay with the forwards, you need some tricks to make it work.

The cheats are pretty reasonable cheats, but once you put them in there, it works and I want to put them in. That Matlab code is available so other people can make it work, but I didn't want to post my very primitive Matlab code because it's disgusting. Thanks, that's all for this week's podcast. I want to thank Jeff for his time. I also want to thank clear ml for his support, we are looking for more sponsors so if you are interested in supporting the podcast please email me at Craig c r a i g at ionai that's e-y-e-hyphen at dot a i as always you can find a transcript of this episode on our website ey script at point a i I recommend you read the transcript if you really want to understand the forward algorithm in the meantime, remember that the Singularity may not be close, but the AI is close to change your world, so pay attention.

Watch Video & Subscribe

If you have any copyright issue, please Contact