What is ChatGPT doing...and why does it work?

Apr 07, 2024

Well, hello everyone. Usually at this time every week I ask a question about science and technology for children and others, which I have been

doing

for about three years, where I try to answer arbitrary questions about science and technology. Today I thought I would do something slightly different. I just wrote an article about GPT chat. What

does

it really do? Why

does

work

? I thought I'd talk a little bit about that here and then open this up for questions and I'll be happy to try to talk about that. uh, all kinds of GPT AI chat, big language models and so on, uh, I might know that, so

what

a couple of months ago was our friend GPT chat burst onto the scene.

I have to say it was a surprise to me that it

work

ed so well that I had been following neural network technology for about 43 years and there have been moments of significant improvement and long periods of time where it was an interesting idea but it wasn't. . It's not clear where it was going to go the fact that GPT chat can work as well as it does and can produce reasonable, human-like essays is quite remarkable, quite unexpected, I think even unexpected for its creators and

what

I want to talk. It's, first of all, how the BT chat sheet basically works and, secondly, why it works, why it's possible to do what has always seemed to be a kind of pinnacle of human intellectual achievement, you know, write that essay . describe something why it is possible.

More Interesting Facts About,

what is chatgpt doing and why does it work...

I think what Chachi PT is showing us is some things about science and language and thinking about things that we might have suspected for a long time but didn't really know and it really shows us some kind of scientific evidence for this, okay, so what is GPT chat actually

doing

? Basically, the um uh, the kind of um uh, the starting point is trying to write in a reasonable way, you're trying to take an initial chunk of text that you can give and you're trying to continue that chunk of text in a reasonably human way. which is kind of a characteristic of typical human writing, so you give him a cue, you say something, you ask something and it's like he's thinking to himself.

I've read the All over the web, I've read millions of books, how would they normally continue from this message that they gave me what is the kind of reasonable expected continuation based on kind of an average of, you know, a few billion of web pages, a few million books, etc., that's what you're always trying to do, you're always trying to continue from the initial message that's given to you, you're trying to continue in a statistically sensible way, so Let's say, let me, uh. start sharing here um let's say um uh you had given him what you said initially the best thing about AI is his ability to then chat GPT has to ask what is what he's going to say next now what thing should he say explain about gbt chat, that's a bit shocking when you first hear it, it's those essays that he's writing, he's writing it one word at a time, because as he's writing each word he doesn't have an overall plan about what's going to happen, he's just saying What is the best word to write next based on what I have already written?

It is notable that in the end one can get an essay that seems coherent and has a structure, etc., but in a sense it is being written. one word at a time, so let's say the prompts have been the best thing about AI is its ability. Okay, what strategy are you going to do next. Well, what you're going to do is say, well, what's what? the next word is based on everything I've seen on the web and Etc, etcetera, etcetera, what is the most likely next word and it knows certain probabilities, um, what calculates our probabilities, says learn, has a probability of 4, 5, predict 3.5 percent, etc., and So what you're going to do is type the next word that you think you should type, so one strategy you could adopt is: I'll always type the word that has the highest probability based on What i have seen.

Internet, etc., it turns out that that particular strategy of just saying "leave what has the highest probability" doesn't work very well, no one really knows why one can have some guesses, but it's something where if you do it, you end up Getting this kind of very flat essays, often repetitive, even sometimes word for word, repetitive, it turns out that this is typical of what in a season, a kind of big engineering system like this, there is a certain kind of touch of voodoo that is needed to make things work well and part of that is saying don't always take the highest probability word take some with a certain probability take a word of lower than lower than higher probability and there is a whole mechanism that is usually called the parameter of temperature temperature um ordering by analogy with statistical physics and so on, you're moving things up to a certain point and the higher the temperature, the more you're moving things around and not just doing the most obvious thing of taking the highest probability word.

So it turns out that a temperature parameter of 0.8 apparently seems to work best for producing things like assays, so okay, let's see what's needed. One of the things that is good to do is to get some type of concrete. View of what's going on, we can actually, um, start looking at, um, uh, on our computer, what it's doing. I must say that this, what I will talk about here, is based on this article I wrote. um that just came out a couple of days ago um and uh, um and I have to say that every piece of code there is is click to copy so if I click on each image, click to copy, if I click this, I will do it. get a snippet of woven language code that will generate that, let me come down here and start showing you kind of how this actually works, so what GPT chat at the end is um uh oops not seeing the interesting screen oh well Well, here we go, well, let me, let me show you again, so, um, uh, what I was showing before this is the piece that I wrote and I just wanted to emphasize that every image, etc., is in this.

The piece uh has a click to copy the code, you just click on it, paste it into a Woven Language Notebook on a desktop or in the cloud, and you can run it, okay, so let's see how, let's run an approximation of the less to GPT chat, such an open AI, produced a number of models in recent years, um, and GPT chat is based on GPT 3.5. I think the model, um, these models became progressively larger, progressively, more impossible to run directly on the local computer, um, this is a small chat version of the gpt2 model, which is something that you can run on your computer and it's part of our uh wolf neural network repository and you can just pick it up from there and um uh this is now the um A kind of neural network that's inside a simplified version of Chachi PT and we'll talk about what all these inerts actually are more go ahead, but for now we can do something like say, let's use that model and have it tell us the um the words with the top five probabilities um based on the initial message uh the best thing about AI is also its ability, so those are the five main words, let me, let me, I can probably ask you 20 words here so let's say um, let's see that they're probably arranged correctly, we probably want to sort them in reverse order um and uh uh, this will now show us the um uh oh, I see this is getting sorting well, so this is um, this shows us uh these words with different probabilities here I'm really confused why this wasn't like that oh, I know, I don't know, I know, I didn't do that, I know I didn't do that, um, let me just make this do what I expect, okay? here we go so this is um this is that sequence of words um uh uh now it's for word number 20 that we're going to keep I don't know let's just for fun let's find out what word number 50 is okay so here down we are, we're um uh, we're looking at words that were thought to be less likely.

What does it mean to be less likely? It means that, based essentially on Chat GPT's extrapolation of what it has seen in billions of documents on the web this is the word which these are the words that have a certain probability of appearing next in that particular sentence okay, now Let's say we want to uh we want to continue we want to say um let's say we want to say the best about your ability and the next word you might choose might be learn, but then what's the word you'll choose after that? Well, we could solve it by just saying um here, let's say the next word. was learning, okay, so let's say what we would get next, we'll complete the learning there and we'll just say let's get the next five major properties for the next word, okay, so the next word is that's the next most likely word. from like this We could say learn from and then the next most likely word is experience, so let's write a piece of code that automates and we're going to nestedly apply this function that just takes the most likely word, so to speak. let's do it ten times um and uh this is now what we get this is using the gpt2 model um this is asking what is the most likely continuation of that piece of text, so there it goes now This is the case where you always choose the word more likely, as I said before, um, uh, very quickly it ends up in this zero temperature case, very quickly it ends up getting tangled in some loop. let's see if I have the example of what it actually does in that case um uh um let's see uh yeah here we go um and um this um this is not a particularly good essay uh impressive and it gets pretty tangled if you don't always choose the most likely word, things they work much better, so, for example, here are some examples of what happens when you use this temperature to shake things up a little and you don't always choose the word that is estimated. it's more likely um it's worth realizing that there's a I showed you some examples of um um of less powerful words there's a huge spectrum of how different words can occur with progressively lower probabilities is kind of a typical observation about language that the um what you see here also that the nth most common word has a probability of about one in n and that's what you see for the word that will follow next and you also see that in general for words and text, well, we can um uh, we could ask what happens in the case of zero temperature for a um uh, let's see for um uh for the real model um um uh gpt3 um this is, this is what it does for zero temperature now, one feature of this is if you use some, well, for example this is a link to the API for open AI, that's in our package repository, if you use that link and just call gpt3 it will do that because the most likely word is always chosen, it will be the same every time, so there is no randomness in this, what usually happens when you choose these words when you choose not when you have a non-zero temperature and you choose words that are not always the most The word likely is that some randomness is being added and that randomness will cause you get a different essay every time and that's why if you say regenerate this essay, chances are you're going to get a different essay every time you read every time you press regenerate. button because you are going to choose different random numbers to decide which of the classified words you are going to use, so this is a typical example of a 0.8 um type temperature essay generated by gpt3, okay, so the next big question is: we have these probabilities, um, words, etc., where do those probabilities come from?

So what I was saying is that the probabilities are basically a reflection of what's on the web and those are the things that chat with GPT. You've learned that you're trying to mimic the statistics of what you've seen, so let's take some simpler examples of that, let's say we're not dealing with traffic gbt is essentially dealing with typing words at a time, actually its snippets. words, but We can assume that for the simplest cases they are just words, um, but what if we start to understand this? Let's start thinking about writing individual letters at a time, so the first question is, um, if we're just going to write letters. one at a time what is the um uh with what probability should we write which letter how we calculate it well let's choose a random text let's choose the Wikipedia article about cats and let's count letters in the Wikipedia article about cats and see you know that e is the winner a is the runner-up t comes next um That's right based on the sample English text from the Wikipedia article on cats this is what we would think about the statistics of different letters, let's try the Wikipedia article on dogs, it's well, we probably have a little bit different, we have an O that appears more, more likely, probably because there's an O in the word dog, etc., but if we go ahead and say, well, what. about um uh a really so that's for these specific samples of English let's go ahead let's um uh let's do um uh a um let's see there we're going to use a very large sample of English let's say we have a few million books and we use that as our english sample andwe ask what are the probabilities uh for different letters in that very large sample and we'll see what a lot of people will know immediately: e is the most common letter followed by T A Etc.

Okay, so these are our probabilities, so now let's say we want to start generate uh, generate text according to those probabilities, so this is, look, this is probably just yeah, just um and let's just fill that in, oh, there we go, there's the frequencies and let's just fill in. let's have it start generating letters, this is just generating letters, um, according to the probabilities that we get from um, uh, from the appearance of those letters in English, so we asked it to generate 500 letters with the correct probabilities of corresponding to the English text. Very bad English text, but that's um uh, that should have the number of E should be about 12, the number of T should be about nine percent and so on.

Well, we can make it a little bit more like the English text by going ahead and um. let's fill in um uh let's add a certain probability of having a space and now we can let's make a larger version of this um and now this is generating um quotes English text with the correct probabilities for letters and spaces etc. we can make it a little more realistic by making it the case that um uh the um uh the the um the length of the word in this case here we're just breaking it down into words saying that there are 18 possibilities that a character is a space that is something uh here what we're doing is we're saying let's insist that the words have the correct distribution of lengths and this is now the text that we get where the words have the correct distribution of lengths that the letters have the correct probability of occurring being e the most common and so on clearly clearly not English clearly a waste if if the GPT chat was generating this it would be an error um but this is something that at the individual letter level is statistically correct if it said um if we asked you can you say that this is not English just by looking at the possibilities of different letters? um, I would say this is English um and different languages, for example, have different characteristic signatures of frequencies, you know, if we were to choose this or I don't know what, um, you know, I'm sure that if we choose this for English and We would do the corresponding thing, say which one we would choose, let's try um Spanish here, for example, um and uh um. you'll get slightly different uh frequencies, okay, those are kind of similar but not exactly the same.

Well, that's what happens if um uh, this is kind of generating English text with the correct single letter statistics, we could just plot the um, uh, let's just plot the probabilities um for those individual letters, oh boy , are more complicated than necessary. um, okay, that's just uh, um, that's just the probability of each letter occurring, so e is the most common. Q is very rare. Etc., in this case what we are assuming is that each letter is chosen randomly independently, however in real English we know that is not the case, for example if we have had a queue that has been chosen, then with overwhelming probability the next letter will appear as a u and similarly other types. of letter combinations other types of two grams other types of letter pairs so instead of asking for the probability of just an individual letter we could, for example, say um, what is the probability that a pair of letters join together Look here we go? um, so this is a question, um, uh, that is, given that the letter B appeared, what is the probability that the next letter is e? so it's quite high the probability of the next letter being f is very low here when there is a q the probability of the following letters is only substantial when there is a u um as the next letter so that's what um seems to have um that's what um this combination of letter pairs, the probability is for combinations of letter pairs, so now let's say we try to generate text letters at a time, um, not just dealing with the individual probabilities of the letters , but also with the probability of a couple of letters, okay, now we do that and it will start to look a little bit more like a real text in real English, there are a couple of real words here like on and thee and well, Tesla, I guess it's some kind of word um and uh uh, this is now getting a little bit closer to actual English text because it's capturing more statistics from English, we can move on instead of just dealing with having the correct probabilities for individual letters , letter pairs, etc., we can go ahead and say let's have the correct probabilities for uh letter triples uh four letter combinations and so on um the uh and this is um um actually this these numbers are probably one apart because actually are letters on their own these are pairs of letters and so on so this is six tuples of letters and we can see that by the time you get there, when you say I want to follow the probabilities of six tuples of letters, we will get complete English words as an average, etc. , and the fact that that's how it ends, that's why auto complete um when you're typing on a phone or something like that can work as well as it does.

Because when you have AV ER, there's really only a limited number of words that can follow that, so you've pretty much determined it. that and that's how probabilities work when you're dealing with blocks of letters instead of little numbers of letters, okay, that's the idea, um, in some way, you're capturing the letter statistics, the sequence statistics. of letters and you're using that to randomly generate text types like things, so, um, we can do that too, not just with individual letter probabilities with word probabilities, so in English there's maybe 40 or 50,000 quite commonly used words and we could just say, based on a large sample of millions of books or something, what are the probabilities of those different words and the probabilities that the different words have changed over time, etc., but let's say that we say what we say over the course of all the books or for the current moment what the probabilities are for all those, say, 50,000 different words and now just start generating sentences where we choose those words at random, but with the probabilities that correspond to uh frequencies with which they occur in these samples of English text, so there is a sentence that we managed with that method and it is a sentence in which, well, these words appear with the correct probability, this sentence actually it doesn't mean anything it's just a collection of random words now we can do the same thing we did with letters instead of just saying we use a certain probability for an individual word we say we correctly calculate probabilities for pairs of words based on a sample English text, etc. we do that, it's actually kind of computationally difficult to do this even for pairs of words because we're dealing with 50,000 different square possibilities, etc., etc., but now let's say we start with a particular word, let's say we start with the word cat is our message type here, so these are sentences that are generated with the correct probabilities for word pairs, so we'll look at things like the book and, well, throughout it's a little strange, but the confirmation procedure.

I guess they're a pair of words that appear together in at least the um uh where all of this text was sampled from, so this is what you get when you're sampling text, kind of word pairs. at one point, this is a very pre-chat kind of GPT, it's kind of a super minimalist version where it's just word pair statistics instead of the much more elaborate stuff that it's actually doing now. You might say, well, how about we do something more like what GPT chat does? Instead of choosing pairs of words, let's choose combinations of five words, 20 or 200 words, you know, let's ask him to give us the message that we have received. specified, let's ask it to add the next 200 words with the probability of matching what you would expect based on what's on the web, so maybe we'll just make a table of what the probability is of having this three-word combination forward . combination of five words, okay, here's the problem with that, the problem is that there simply isn't enough text written in English or any language to be able to estimate those probabilities in this direct way; in other words, um by the time you're in um, you know that's maybe 40,000 common words in English, which means that the number of word pairs that you have to ask, the probability is 1.6 thousand. million, the number of triples is 60 billion and very quickly you finish.

Can you think of something where it's not possible that there simply isn't enough written text on the few billion web pages that exist, etc., to be able to take a sample of those 60 trillion word triples and say what the probability that each of these truffles for when you like a 20 word essay, you're dealing with the number of possibilities being greater than the number of particles in the universe, you wouldn't even be able to register those probabilities, even if you had a text that you know was written by some kind of infinite collection of monkeys or something that imitated humans and was able to do that, so how do we deal with this?

How does Church EBT work? It's about the fact that you can't take enough text from the web to be able to make a table of all those probabilities. Well, the key idea, which is a very old idea in the history of science, is to make a model. What is a model? A model is something where you're summarizing data, you're summarizing things in a way where you don't have to have all the data you can generate, you can just have a model that allows you to predict. more data, even if you didn't have it right away, a quintessential example - a very early example of modeling was Galileo in the late 16th century, you know, trying to figure out things about objects falling under gravity and you know, going up to the Tower of Pisa and launch cannonballs from different levels. at the Tower of Pisa and saying how long it takes for these things to reach the ground so they can make a plot.

My goodness, that's a remarkably convoluted way to do this plot. Alright. I could make a plot of uh, you know, I don't know. How many floors are there actually in the Tower of Pisa, but imagine that there were this many flaws? You can make a plot and you could say, uh, measure, you know in those days by taking his pulse or something, how long did it take him to get up? Cannonball to hit the ground and this is a function of the floor it was dropped on and how long it took for the Cannonball to hit the ground so there is data on specific times for specific failures but what if you want to know how long How long would it take? take the cannonball hitting the ground if you were on the 35th floor, which wasn't explicitly measured, so this is where the idea comes in of well, let's make a model and something typical that you could do.

What we need to do is say well, let's assume it's a straight line, let's assume that, um, that the time to hit the ground is a function of the floor and this is the best straight line that we can fit across that data. This allows us to predict the moment of hitting the ground from a floor that we do not explicitly visit. Basically this model is a way to summarize the data and summarize what we hope to do when we continue with this data, the reason this will be relevant to us is that as I mentioned there is not enough data to know these probabilities for different words just from the actual text that exists, so you have to have something where you're doing a model where you say assume that this is how things work in general, this is how we would find the answer when we haven't done a measurement explicitly so that you know we can do different models and get different results, so for example, we could say you know, here's another model we could choose, this is a quadratic curve, um, uh, through these particular data points, now It is worth realizing that there is no modelist model. make certain assumptions about how things work and in the case of these physics problems, like dropping balls from towers, etc., we have a pretty good expectation that these kinds of simple mathematical models, mathematical formulas, etc., are probably things that work. work doesn't always happen that way you know this is another mathematical function this is the best version it has some parameters in this model this is the best version of that model to fit this data and you can see that it is a completely poor fit for this data , so if we assume this is the general way things work, we won't be able to correctly reproduce what the stator says.

In the case of this model, I think it has three parameters that it's trying to fit to this data and it's not doing very well um and uh in the job chat that it's doing gbt basically it has 175 billion parameters that it's trying to fit. adjust to make a model of human language and are trying toexpect that when you have to estimate the probability that something in human language will work better than this than with its 175 billion parameters that the underlying structure you are using is such that you will be able to more correctly estimate, for example, the probabilities of things, so let's look at things, so the next big thing to talk about is doing things like dropping balls from the Towers of Pisa and and that's something we've learned over the last 300 years from Galileo and so on, that there are mathematical formulas simple ones that govern those kinds of processes, physical processes in nature, but when it comes to a task like what is the most likely next word or some other kind of human task, we don't have a simple kind of mathematical-style model, so For example, we could say here's a typical human task that they give us, they ask us to recognize, among a series of an image a series of pixels that are numbered among the 10 possibilities is this is this and then we um uh and you know what We humans do a good job of saying well, that's a four, that's a two, and so on, but uh we need to ask ourselves how we think about this problem, so one thing we could say is let's try to do what we were doing, where we say: let's just collect the data and find out the answer based on the data collection. we could say, well, let's get a complete collection of four and ask ourselves: when we are presented with a particular array of pixel values, does that array of pixel values match one of the four that we have in our sample?

The chance of that happening is incredibly small, and it's clear that humans do better than that. It doesn't matter where the individual pixels fell here, as long as it's roughly the shape of the four we're going for. recognize it as a four, so the question is, um, how does that work? and, uh, what, um, what we found is that, um, uh, it's um, well, let's say this is using, actually, it's using this type of machine learning standard problem, um. This is using a simple neural network to recognize these handwritten digits and we see that it gets the correct answer there, but if we say well, what is it actually doing?

Let's say we give it a progressively blurrier set of digits here at the beginning. He understands them well and then quotes them wrong. What does it mean that he understands them wrong? We know this was a two that we put here and we know we kept blurring that too, so we can say well, you got it. wrong because we knew it was supposed to be a two, but if we zoom out and ask what's happening on a broader level, we say well, if we were humans looking at those images, would we conclude that it's a two or not?

Once it gets blurry enough, we humans wouldn't even know it's a two, so to evaluate whether the machine is doing the right thing, what we're really asking is whether it's doing something more or less like what we humans do. That becomes the question: can't we ask for these kinds of human-like tasks? answer is okay for humans, we could say well, up there you know most humans would recognize that as two. If instead we had visual systems like bees or octopuses or something, we might come to completely different conclusions once things get blurry, um, could the question of how we consider there to be two could be quite different?

It is a very human response. Saying that that still looks like a two, for example, depends on our visual system. It's not something where there's a kind of mathematical precision. The definition of that has to be a two. Ok, so the question is how do these models work? How do these models that we're using for things like image recognition work? How do they really work? The most popular by far and the most successful today. To do this is to use neural networks and so what is a neural network? It's a kind of idealization of what we think is happening in the brain.

What is happening in the brain. We all have about 100 billion neurons in our brain. They are nerve cells that have the characteristic that when excited they produce electrical signals perhaps a thousand times per second. Each nerve cell takes that electrical signal and has sort of cable-like projections from the nerve. cell that is connecting to maybe a thousand, maybe 10,000 other nerve cells, so what happens in sort of a rough approximation is that you will have electrical activity in a nerve cell and that will communicate with other nerves. cells and there's this whole network of nerves that has this elaborate pattern of electrical activity with electrical activity, so roughly the way it seems to work is that the degree to which one nerve cell will affect others is determined by the type of weight . associated with these different connections, so one connection could have a very strong positive effect on another nerve cell, if the first nerve cell activates then it is very likely that the next nerve cell activates that entire connection, it could be an inhibitory connection where If one nerve cell is activated, it is very unlikely that the next nerve cell will be activated.

There's a whole combination of these weights associated with these different connections between nerve cells, so you know what's really happening when we try to recognize a two in one. Image, for example, well, you know, the light from the photons in the image falls on the cells in the back of our eye and in the retina, there are photoreceptor cells that convert that light into electrical signals, the electrical signals end up passing through . nerves that go to the visual cortex in the back of our head um and uh there's a series of nerves that correspond to all the different positions essentially of pixels in the image and then what happens is that inside our brain there's this sequence of connections are sort of layers of neurons that process the electrical signals that come in and eventually we get to the point where we think that the image that we're seeing in front of us is a two and then we could Let's say it's two um, but that process of forming The thought of that is what we are talking about as a kind of recognition process.

I was talking about it in the actual neural networks that we have in the brain, but what is being done? in all of these models, including things like chat, GPT is an idealization of that neural network, okay, for example, in um uh, in the particular neuron that we were using for image recognition, this is a kind of representation of language of that neural network um and we're going to talk about um not in full detail but we're going to talk about all these pieces here um it's a kind of biological engineering, there's a lot of different little original pieces here that go together to actually have the result of recognizing digits. and so on, this particular neural network was built in 1998, and it was really done as a piece of engineering, so how do we think about the way this neural network works, essentially, what kind of the key idea is? the idea of attractors is an idea that actually came out of mathematical physics and so on, um, but it's a key idea when we think about neural networks and things like that, and what is that idea, let's say we have got all these different digits written in hand, both, etc., etc., etc., what we want is that if we place all these digits in some way, what we want is that if we are close to those that attract us. the only point if what we have is close to both, we are attracted to point two, it's kind of the idea of attractors is to imagine that you have a I don't know a mountainous landscape or something like that. this and you are, you know, you are a drop of water that falls somewhere on the mountain, you are going to roll down the mountain until you reach this minimum that is for your particular part of the mountain. but then there will be a watershed and if you are a raindrop that falls somewhere else, you will roll into a different minimum, a different lake and it's the same kind of thing here when you get far enough away from the something that looks like one and you will roll towards the attractor of two instead of the attractor of one, that is the idea now.

Let's see, we can, um, let's do a kind of idealized version of this, let's say I have a bunch of points on the plane, let's say they're coffee shops and you say I'm always going to go to the coffee shop closest to me. Well, this so-called Lauren's eye diagram shows you this type of watershed division. between coffee shops, if you're on this side of this Basin, you'll go to this coffee shop, if you're on that side, you'll go to this coffee shop, so that's kind of a minimal version of this idea of all attractors. Okay, so let's talk about neural networks and the relationship with attractors, so let's take an even simpler version, let's take these three attractors, there is the zero attractor, the plus one attracts the least one, what is it going to say if we fall in this. region we will have these X and Y coordinates, so if we are in this region here, eventually we are going to want to go to say the result is zero, we are at zero.

We're in the zero attractor basin and we want to produce a zero, okay, so we can say, um, we can say, based on the X and Y position, which output we start from. we want to get better um on this on this side we want to get a one this we want to get uh what's that a minus one there we want to get a zero this is what we're trying to uh we're trying to set up something where we'll have this type of behavior. Okay, now we're going to bring in a neural network, so this is a typical little neural network.

Each of these points represents an artificial neuron, each of these lines represents a connection between neurons and the type of blue to red represents the weight associated with that connection, blue being the most negative, red being the most positive here and this shows up differently, this shows a neural connection net with particular options for these weights by which one neuron affects others, so how do we use this neural nut? We put inputs at the top, we say those top two neurons got values of 0.5 and minus 0.8, for example, interpreting that in terms of What we're trying to work with is saying we're at position x equals 0.5 and it's equal to negative 0.8 in that diagram that we had drawn, so now this neural network is basically just computing a certain function of these X and Y values and in each step what it's doing is it's taking these weights and it's just taking it for this neuron here what it's doing is saying I want this weight multiplied by this value here this weight multiplied by this value here and then what it's saying is I'm let's add those two numbers based on the weights multiplied by the original number, then there is one thing that we add, we add a constant offset, a different offset for uh, but we add this constant offset and then we say we get some number and then the kind of strange thing that one does, which is inspired in what seems to happen biologically, is that we have some kind of threshold function, we say, for example, this is very common to use relu um if that the total number is less than zero, make it not its actual value, but just zero if it is greater than zero, let it be its actual value and there are a variety of different so-called activation functions because they are what determine what the activity of the next neuron in line will be based on the input to that neuron, so which here again, in each step, we are just collecting the values of the neurons in the previous layer, multiplying by weights, adding this offset and applying that. relu activation function to get this value minus 3.8 in this case and what is happening.

Here we start with these values 0.5 minus 0.8, we go through this entire neural network in this particular case, in the end it comes out with a value minus one, okay? uh, what does that neural network do, which is the neural network here, the one we just showed? What does it do when we change those inputs? Well, we can plot it, that's what that neural network actually does as a function, so remember what our goal is. uh every time we have a value in this region we want to give this region a zero we want to give it a minus one and so on, this is what that particular neural network manages to do, so it didn't manage to make you known. zero values one minus one, but they are a little bit close, so this is a neural network that has been configured to be as close as possible to one of that size and shape, and so on to give us the exact function we wanted .

To calculate well, how do we think about what this neural network is doing? The neural network is just computing some mathematical function, so for the particular neuron I was showing, if the W's are the weights and the B's are the offsets etc., the f is the activation function, this is kind of confusing algebraic formula that says what the value of the output will be based on X and Y, the values of the inputs, so now the question is: as we look at simpler neural networks, what? types of functions we can actually calculate so this is at a kind of minimum level this is a single uh this is a neuron here it is receiving information from two other neurons what function is itComputing well depends on the weights these are the functions that are calculated for these different options of Weights, very simple functions in all cases, just these ramps, so now we can ask, well, let's use a slightly more sophisticated neural network. um, it's still a very small neural network, this is the best it can do.

By reproducing the function, we want to get a slightly larger neuron that performs a bit better and an even larger neural network. It's pretty much nailed down, it didn't make it right on the edge, it's a little fuzzy instead of going straight from red to blue, it's got this. area where it is giving yellow and so on um but in a first approximation this little neural network was a pretty good representation of the mathematical function that we wanted to calculate and this is the same story as what we are doing um in uh in that um uh recognition of digits where again we have a neural network that has I don't know what it was, I think it's um uh 40,000 um parameters in this particular case that uh um that specifies sort of um uh that are doing the same type of thing of solving the function that goes from the pixel array at the beginning to the values from zero to nine and so on, well, again, we can ask the question, uh, you know, are you getting the right thing?

Answer well again, it's a difficult question that's really a human level question because the question of whether you put one in the wrong place, so to speak, is a question of how we would define that well, we can do similar types of things, let's say . we have other types of images that we could try and create a neuron that distinguishes cats from dogs and here we are showing how it distinguishes those things and mainly the cats are in this corner the dogs are in this corner um but you know the question of what should you really ultimately do, uh, you know what should you do if we put a dog in a cat suit, should you say it's a cat or should you say it's a dog, um, are you going to say something definitive, the question is .

Is it in any way in line with what we humans would evaluate? Well, you know, one question you might ask is: what does this neural network do inside when it turns out to be some kind of Katniss or her canine character? And let's say we start with um. let's do this and we can actually do the same thing uh let's say we start with an image um well maybe you know let's say we start with an image of a cat here now we can um uh we can tell what's going on inside the neural network when it decides that in It is actually an image of a cat.

Well, what can we normally do when we look inside a neural network? It is very difficult to know what happens in the case that the neural network corresponds to an image that we can see. at least neural networks tend to be set up in a way that preserves the pixel structure of the image, so for example, here we can go, this is just go, what is this, this goes, um, uh, 10 layers down, no, not this. I'm really sorry about this, this actually goes just one layer down in the neural network and what happens in this particular neuron is it takes that image of a cat and splits it into many different types of variants of that image now in En this level we can say well, it's doing things that we can recognize, it's like looking at the outlines of the cat without the background, it's trying to get the cat out of the background, it's doing things that we can imagine, eh, you know. describe in words what is happening and, in fact, many of the things it is doing are things that we know from studying the neurophysiology of the brain and that are what the first levels of visual processing and the brain actually do when we are more deep in the neural network um, it's much harder to know what's going on, let's say we go down 10 10 layers in the neural network um so, uh, we have again something like in the mind of the neuron this is what it's in thinking Try to decide if it's a cat or a dog.

Things have become much more abstract, much more difficult to recognize explicitly, but that's kind of a representation for us of what's happening in the neural network kind of Mind. You know, if we say well, what is a theory about how cat recognition works? cat, you know, we can't um uh and if you even ask a kind of human how you know, we say well, it's got these pointy ears, it's got this and that, um, it's probably hard for a human to describe how they do it. recognition and when we look inside the neural network we can't have a kind of uh there's no guarantee that there's a kind of simple narrative of what it's doing and it's usually not right so we've talked about how it works the neural network.

Networks can successfully go from a cat image to saying "it's a cat", "this is it, it's a dog", how do you configure the neural network to do that? The way we normally write programs is to say well, I'm thinking about how this program should work. What should you do if you take the cat image first? Has? Do you know what the shape of their ears is? Do you have mustaches? All this kind of stuff. That's the typical engineering way of making a program. that's what people did 15 years ago, 20 years ago, by trying to get it to recognize images of things, which was the typical kind of approach which was to try to recognize explainable human characteristics from images, etc., as a way kind of recognizing things, the big idea of machine learning is that you don't have to do that, instead what you can do is just give a bunch of examples where you say this is a cat, this is a dog and give it the case that I have a system that can learn from those examples and where you just have to give it enough examples and then when you show it a new cat image that has never been seen before, it will correctly say that it is a cat versus a dog, so let's talk about how that's actually done um and uh what we're interested in is if we can take one of those neural networks.

I showed that neural networks have all these weights and as you change the weights, you change the function of the neural network. Computing let's say you have a neural network and you want it to compute a particular function, so let's say let's take a very simple case, let's say we have a neural network, we just want it to compute as a function of X, we wanted to compute this. particular function here, okay, so let's choose a neural network, there is a neural network without weights, let's populate random weights in that neural network for each random collection of weights in the neural networks, the neural network will calculate something, it won't be the function. we want to, but the law is to calculate something, it will always be the case that when you enter some value up here you will get some value down here and these are graphs of the function that you get by doing that, okay, the big one The idea is that if you do the right way and you can give enough examples of um uh um of um uh what function you're trying to learn of, you'll be able to progressively adjust the weights in this neural network so that eventually we'll get a neural network that correctly computes this function, so again I'll do it.

What we're doing here is we're just describing if this is X, let's say you know G of X down here, this is the value of x up here and this is a g of type of square wave here now, in this particular case, this neural network with these weights is not calculating the function that we wanted, it is calculating this function here, but As we progressively train this neural network, we adjust the weights until we finally get a neural network that actually calculates the function we want. In this particular case, it took 10 million examples to get to the point where we have the neural network we want.

Okay, so um, how does this actually work? How is this actually done? How do you do this? As I said at the beginning, we just started with neural networks where we had random weights with random weights, this function Even if we have examples of functions, examples of results, how do we go from them to training the neural rats? Basically, what we are doing is us. run, we say we have this neural network, uh, we say, let's choose a value of x 0.2, for example, let's pass it through the neural network, see what value we get, okay, we get this value here, oh, we say that the value is not correct. in what we were trying to do based on the training data that we have based on this function that we are trying to train the neural networks to generate that training, it is not the correct result, it should have been. let's say minus one and in fact it was 0.7 or something like that, so the idea is that knowing that we were wrong we can measure how much we were wrong and we can do that for many different samples that we can take, let's say a thousand examples of this mapping from the X value to the G function of For example, taking the sum of the squares of the differences between those values, um, and that gives us an idea that if all the values were correct, that would be zero, but it's actually not. zero because we didn't actually get the right answer without knowing it and so what we're trying to do is progressively reduce that loss, we're trying to progressively adjust the neural network to reduce that loss, for example this is what would normally look like what normally you have, this is the loss based on the number of examples you have shown and what you see is that as you show more and more examples, the loss progressively decreases, reflecting the fact that the function that is being calculated by the neural network is getting close to the function we really wanted and finally the loss is quite small here and then the function actually calculated by the neural network is very close to the one we wanted, that's the idea. of training a neural network, we are trying to adjust the weights to reduce the loss and get to where we want, okay, so let's say we have a particular neural form of Weights, we calculate the loss, the loss is really bad, it's us.

We are quite far away, how can we gradually get closer to the correct answer? Well, we have to adjust the weights, but in which direction do we adjust the weights? Well, this is a complicated thing that, um, was solved well in the 1980s, so later we explain how to do this in a reasonable way, we knew how to do it in simple cases. I should say the idea of neural networks originated in 1943 um uh Warren McCulloch and Walter Pitts were the two guys who wrote this guy. from the original paper that described these idealized neural networks and what's inside the GPT chat is basically a big version of what was described in 1943 and there was kind of a long history of people doing things with a single layer of neural networks and that it didn't work. very well and then in the early 1980s there started to be some knowledge about how to deal with more layers of neural networks and then when GPUs started to exist and computers became faster there was a big breakthrough around from 2012, where it became possible to deal with uh. a kind of training and use of a kind of deep neural networks.

By the way, for people interested, I had a conversation with a friend of mine named Tara Sanovsky, who has been involved with neural networks for about 45 years and has been instrumental in many of the many developments that have happened, I had a discussion with him which was broadcast live a few days ago and which you can find on the web, etc., if you're interested in that story, but back to back to sort. of how these things work what we want to do is we find out that the loss is bad let's reduce the loss how do we reduce the loss we need to adjust the weights in which direction do we adjust the weights to reduce the loss well this turns out turns out to be a great calculus application because basically What's happening is that our neural network corresponds to a function that it has, it's a function of the weights, it's a function of once we calculate the loss, we're basically calculating the value of this neural network. works for many values of do we reduce the overall value?

How do we adjust the weights to reduce this amount of overall loss? Well, we can use calculus. We can say that we can think of this as a kind of surface as a function. of all these weights and we can say that we want to minimize this function as a function of the weights, so for example, we could have a in a very simplified case, well, this is not good. In a very simplified case, we might have um ah. based on just two weights, for example, in those neural networks that I was showing they had, I don't know, 15 weights or something like that.

In the real example of an image recognition network, it could be 40000 pesos in Chachi DT, it is 175. billion pesos, but here we are only looking at two pesos and we are wondering if this was the loss based on the value of those pesos, how would we find the minimum, how would we find the minimum? find the best values of those weights, look, here we go, so this is a typical procedure for using the so-called gradient descent, basically what you do is say: I am in this position on this lost surface, lost surface where the coordinates of the surface are weights what I want to do is get to a lowest point on this lost surface and I want to do it by changing the weights always following this gradient Vector type of hill down the steepest descent down the hill and that's something you just have to use calculus and just calculate the derivatives at this point based on these weights and the direction in which you are finding the maximum of these derivatives, you are going down the hill as much as you can, okay, this is how you try to minimize the loss by adjusting the weights so that you continue this gradient descent to reach the minimum.

Now there is a small error with this because the surface that corresponds to all the weights you may have, as this image shows, may have more than one minimum and in reality these minima may not all be at the same height, for which, for example, if you are on a mountain, there may be a lake on the mountain. it will be a very high altitude mountain lake and all the water following the filters is sent downwards to reach the minimum, it only manages to reach that high altitude mountain lake even though there is a low altitude mountain lake which will have a much lower value. of the loss, so to speak, that you don't reach with this gradient descent method, it's never you, you get stuck in a local minimum, you never reach the more global minimum and that's what potentially happens in neural networks.

May be fine. I'll cut the loss. I'm going to adjust the weights, but wow, I really can't get very far. I can't reduce the loss enough to be able to successfully reproduce my function with my neural network or whatever it is, I can't adjust the weights enough because I got stuck in a local minimum. I don't know how to get out of that local minimum, so this was a big breakthrough in Surprise 2012 in the development of neural networks. was the next discovery, you might have thought that you would have the best chance of getting a neural network to work well when it was a simple mirror that you put your arms around and calculated all these weights and did all these calculations and so on, but actually It turns out that things become easier when the nerves become net and the problem you are trying to solve becomes more complicated and generally speaking the intuition seems to be this, although one did not expect this, I think no one expected this.

I certainly don't. I didn't expect this, it's kind of obvious after the fact, okay, the problem is that you will get stuck while trying to follow this gradient descent well, if you are in some kind of low dimensional space, it's pretty easy. to get stuck you just walk into one of these mountain lakes, you can't go any further but in high dimensional space there are many different directions you can go and chances are any local minimum you reach you can escape from that local minimum because there will always be some Dimension, some direction that you can go that allows you to escape and that's what seems to be happening, it's not totally obvious that it would work that way, but that's what seems to be happening in these Neural networks, there's always a kind of when you have a complicated enough neural network, there's always a way to escape, there's always a way to reduce the loss, etc., okay, that's kind of um uh, this idea of modify the weights. to reduce the loss, that's what's happening in all neural networks and you can, um, uh, there are different schemes, you know how gradient descent is done and how big the steps are, and there's all kinds of things different, there are different ways. you can calculate the loss when we do it for language, we are calculating word probabilities based or word sequence probabilities based on the model versus what we actually see in the data instead of just distances between numbers etc. but it's the same basic idea, okay, so when that happens, let's see, we can potentially get it, every time we run one of these neural networks, we do all these adjustments of weights and so on, we get something where yes, we have an illness.

In that, that reproduces what we want, okay, so these are the results of four different neural networks successfully reproducing this function. Now you might be wondering what happens if I go, um, let's see what happens if, um, yeah, what happens? If I go out of the range where I explicitly trained the neural network, I tell it, I told it my function now let me try to run it with a value of neural network tries to discover things that it was not explicitly trained on and will give completely different answers depending on the details of how the neural network was trained.

It's like he knows the things he's already seen. It's like if you just basically reproduce those examples when it comes to things that are out of the ordinary, you might think differently out of the ordinary, so to speak, depending on the details of that neural network. Well, let's look at this whole question about training neural networks is um uh, it's a giant modern art, so to speak, about how to train a neural network and um, in the last decade, particularly in the last decade, there's been sort of of increasingly elaborate knowledge. of that art of training neural networks and there has been a certain amount of knowledge about how these neural networks should be trained, that is what is developed, so how does that work, what is in that law?

Well, the first question is um uh, you know what? type of neural network architecture, how should you, how many neurons, how many neurons in each layer, how should they connect to each other, what should you use, um and uh, there have been a number of types of observations in a kind of art of networks neural networks that have emerged, so what was believed at first was that each different task that you want a neural network to perform would need a different architecture to perform it, you would somehow optimize the architecture for each task, it turned out that that is not the case.

The case is much more important for you, that there are generic neural network architectures that seem to cover many different tasks and you might say that's not the same as computers and universal computers, you just need it to be able to run different software. on the same computer, same hardware, different software, that was the kind of idea from the 1930s that launched the whole computer revolution, the whole notion of software, etc., is this a repeat of that? Actually, I don't think so. I think this is really something. slightly different. I think the reason neural networks, with a small number of architectures, cover many of the tasks that neuronauts can perform is because the tasks that neural networks can perform are tasks that humans are also pretty good at and These neural networks are replicating something about the way humans do tasks, and so while the tasks you ask the neural network to do are tasks that are human-like, any human neuron that can do those tasks now, there are other tasks that are different types of calculations that neural networks and humans are pretty bad at and will be outside of this zone, it doesn't really matter what architecture you have, well okay, so there are all kinds of other things that people were wondering about as if to say well, let's do instead of making these very simple neurons that were like in 1943, let's make more complicated assemblies of things and put more details.

In the internal operations of the neural network, it turns out that most of those things don't seem to matter and I think that's not surprising from a lot of the science I've done, not specifically related to neural networks. I think that, um, that's it. a uh um um that's something um um that's not too surprising now when it comes to neural networks and how they're designed um there are some features that um uh is useful for capturing some features this is not the right thing, that's the right thing, um, um, uh, there are some features of the data that you're seeing with the neural network that are useful and that seem useful to capture in the actual architecture of the neural network probably in the end it's not completely necessary, you can probably use a network much more generic neural network and with enough training, enough kind of tuning from real data, you could learn all this stuff, but, for example, if you have a neural network that deals with images, it is useful to initially organize the neurons in an array make it like pixels, so this is kind of a representation of the particular network called network that we were showing. for Rec image for um uh digit recognition uh this is kind of a representation that there's a first layer of neurons here that thickens into multiple different copies of the image that we actually saw by looking at those images and then it goes on and finally rearranges what's there.

What to understand about neural networks is that neural networks take everything they are dealing with and grind it into numbers, computers take everything they are dealing with and eventually grind it into zeros. and ones into bits Neural networks right now are mashing things up into arbitrary numbers, you know 3.72, they are real numbers, not necessarily just zeros and ones, it's not clear how important it is, it's necessary when you go to gradually improve the weights and use similar things to calculus, you need to have these continuous numbers to be able to do that, but in any case, whether you're showing the neural network an image, a piece of text, whatever it is at the end. it has to be represented in terms of numbers and that's um uh, that's kind of a but but, but how those numbers are organized, like for example here, there's a series of numbers that are arranged in pixel positions and so on. general. the matrix is reconstituted, rearranged, flattened, etc., and in the end you will get probabilities for each of the ten digits which will be just a sequence of numbers here, a sort of rearranged collection of numbers.

Okay, so let's look at the image on the right, there we go, okay, so this is um uh, so we're talking about something like um uh, how complicated and you're wrong, do you need to achieve it to perform a particular task? ? Sometimes it's quite difficult to estimate it because I don't really know how difficult the task is, but let's say you want a neural network that plays a game well, you can calculate the entire game tree for the game, all the possible sequences of games that could occur. It could be some like a huge game tree, but if you want to achieve a human level of play for that game, you don't need to reproduce the entire game tree.

If you were to do a very systematic computer calculation and just play the game looking at all the possibilities, you would need the entire game. tree, but or you need to be able to traverse that entire game tree, but in the case where you're trying to achieve some sort of human-like performance, humans may have found some heuristic that simplifies it dramatically and you may find that you need just some much simpler, much simpler Neural Network, so this is an example of if the neural network is too simple, then it doesn't have the ability to reproduce, in this case, the function that we wanted, but you'll see. like the neural network.

Networking gets a little more complicated. We eventually get to the point where we can reproduce the feature we wanted. Well, well, then, you can ask, you know? Are there theorems about which functions you can reproduce with which neural? Basically, networks as soon as you have neurons in the middle, at least in principle you can reproduce any function, but you may need an extremely large number of neurons to do that, um and uh, it's also the case that that network neural might not be trainable. It may not be the case that you can find something, for example a gradient that always makes the loss decrease etc., by simply adjusting the weights, it could be that you could not incrementally arrive at that result, well, okay then, oops, let's say!

You have um to uh, you have decided on the architecture of your neural network and now you want to train your neural network, so the next big step is you need to have the data to train your neural network and there are two. basic categories of training that are done for neural networks, supervised learning and unsupervised learning, so in supervised learning you give the neural network a bunch of examples of what you want it to learn, so you could say, here there are 10,000 images of cats 10,000 images ofdogs, the images of cats are all labeled like this is an image of a cat, dogs or there is an image of a dog and you are feeding the neural network, these things that are somewhat explicit that you want to learn now that that is what one has to do for many forms of machine learning.

It may not be trivial to obtain the data. There are often data sources where you're leveraging something else, like you can get images. from the web and they could have alt tags that were text that described the image and that's how you could associate the description of the image, the fact that it's a cat, with the actual image or you could know if we're doing um uh types of audio things that you might have something that you might say, let's get a bunch of videos that have subtitles and give us the kind of monitored information here's the audio here's the text that corresponded with that audio that's what we have to learn, so that that's a type of style of teaching neural networks is supervised learning where you have data that is explicitly examples of here's the input you're supposed to get is the output we're supposed to give and that's great when you can get it , sometimes it's very, very difficult to get the data necessary to be able to train the machine learning system and when people say, oh, can you use machine learning well for this task? there's no training data, the answer will probably be no, um, unless that task is something that you can get some sort of proxy for that task from somewhere else or you can, or you just have to blindly hope that something like that, um uh, was transferred from some other domain it could work just like when you're doing mathematical models you could say well, linear models or something worked in these places maybe we can blindly help their work here it doesn't tend to work like that Well, okay, the other one, the another form of um, uh, no.

I should explain another thing about neural which is kind of important, and that is that something has been very critical over the last decade or so, the notion of transfer learning, so once you've learned a certain amount with a neural network, Being able to transfer the learning that happens when you're in it to a new neural network to give it a sort of head start is very important now that that transfer could be the first neural network learned. the most important features to select an image, let's feed the second neural network with those most important features and let it continue from there or it could be something where you're using one neural network to provide training data for another neural network on that one so the you're competing with each other, a variety of other things like that that actually have different names, the transfer learning thing is mainly the first thing I was talking about.

Well, your problems are about getting enough training. data, how many times do you show the same example to a neural network? You know, it's probably a bit like humans for us when we memorize things, it's often helpful to go back and think back to the exact same example you were trying. memorize before again, that's how it is with neural networks and there are also questions like, well, you know you have an image of a cat that looks like this, maybe you can get the equivalent of another image of a cat just by doing something. simple image processing on the first cat and it turns out that seems to work, that notion of data augmentation seems to work surprisingly well, even fairly simple streams are about as good as new in terms of providing more data, well okay, the type um um. of the other big way of learning, that learning methodology that one tends to use is unsupervised learning where you don't have to explicitly give the type of thing that you have as input, output example, for example in um. in the case of this time to keep track of um um yeah, um uh in the case of something like GPT chat, there is a wonderful trick that you can use, let's say the gbt chat mission is to continue a piece of text, is it good? train it right, you just have a bunch of text and you say "okay", you know, chat with the GPT network, here's the text up to this point, we'll mask the text after that point, can you predict what's coming?

What you know, can you learn? to predict what happens if you take off the mask and that's the task that you don't have to explicitly give it, you know, input and output, you can implicitly get it just from the original data that you've been given. Basically, what happens when you're training the GPT chat neural network is that you're saying here's all this English text, which comes from billions of web pages. Now look at the text up to this point and ask: can you correctly predict what text will appear? then, okay, it's wrong, you can say, well, it's giving, it's getting, so, that provides, uh, you know, that means it has a uh, there's some loss associated with that, let's see if we can adjust the weights and the neural network to get it closer to correctly predicting what's coming next, so anyway, the end result of all this is that you create a neuron that could show you neural that training in uh um, could show you more from the language. it's very easy to train neural networks to um let's see oh maybe I shouldn't do the spell yeah let's see um let's do one so here's a collection of handwritten digits um this is what it's going to be maybe 50 100 digits uh oh there come on so this is a supervised training story where here are all the zeros and they say that's a hundred zero and it says it's a zero that's the Nine It says it's a nine okay so let's take a random sample of um I don't know two thousand of those um and now we're going to use that, okay, there's our random sample of two thousand, um, hundred digits and what was supposed to be okay, so let's take it, let's go to neural network, let's say, let's try to take this Lynette's neural network, this is now a um um um an untrained neural network um and now we can say that if we wanted to, we should be able to say uh, just train that neural network with this data there's that data there you go on line 32 um Let's say uh train this and then what's going to happen is this shows us the loss and this shows us how it's presented with more and more of those examples and it's shown the same example many times, you'll see the losses decrease. and it is learned gradually.

Okay, now we have a trained neural network and now we can go back to our original collection. um from uh from digits, let's close that, um, come on. Back to our original collection of digits, let's pick a random digit here, let's see if from um, let's just pick, let's pick another random sample here, um, let's pick five examples there, from um, oh, I shouldn't have called it to do that, it's well, here we go. so now we can take this trained neural network here is our trained neural network and let's take the trained neural network and let's feed it with those particular nine.

Remember we only trained it on 2000 examples, so it didn't have much training, but oops! I shouldn't have done that, I should have used that, okay, okay, it successfully told us it was a nine, this is what it looks like to train, you know, the wolfen language version of training a neural network, this was super simple. neural that with only two thousand examples um but that's what it looks like to do that um do that training okay so uh let's see the um uh what's up with the GPT chat is that you um well let us know we can we can Go ahead and talk about the training of that, but before we move on to the GPT chat training, I'm going to talk about one more thing that we need to talk about, um, this question, well, let's see.

Do we really need to talk about this? Yeah, I would probably talk about this, the question of how things like words are represented with numbers, so let's say we're going to have um, we have all these words and We could just number each word in English, we could say Apple is 75, the pair is 43, etcetera, etcetera, etcetera, but there are more useful ways to label English words by numbers and the most useful way is to get collections of numbers that have the property that words with close meanings have close collections of numbers, so which is like we are replacing each word somewhere in some meaning space and we are trying to set it up so that the words have a position and a meaning space with the property that if two words are close and mean space, they must mean almost the same thing, so here, for example, there is a collection of words arranged in one of these meaning spaces, kind of real meaning spaces like the one that Chachi pitia uses like uh, what is that? probably 12,000 dimensions maybe um this one here is just two dimensional we're just putting things like dog and cat crocodile crocodile and then a bunch of fruits here and the most important thing to notice about this is that things with similar meanings like crocodile and the crocodile end up close in this meaning space and you know, peach and apricot end up close in meaning space, so in other words, we're representing these words by collections of numbers, in this case just pairs of numbers, just coordinates that have the property that those coordinates are some kind of representation of the meaning of these words, so we can do the same thing when it comes to images, for example, we could ask if, um, when we look, and that's exactly what we had when we looked um, an image.

In this way, we are putting different handwritten digits into some kind of meaning space of handwritten digits, where in that meaning space, the ones meaning one were here, the ones meaning three were here, and so on. So one question is how do you find how to generate coordinates that represent so-called embeddings of things so that when they are close in meaning they have close coordinates? Well, there are a number of interesting tricks. that are used to do this, so a typical type of setup is, imagine we have, this is just a representation of the neural network that we use to recognize digits, it has these multiple layers, each one is just Little Wolf and a representation of the language of that, um, what?

Actually, this network works fine in the end, what it's doing is taking that collection of pixels at the beginning and in the end what it's doing is calculating, what are the probabilities for a particular configuration that you have? It's going to produce a collection of numbers at the end because remember that neural networks all they deal with are collections of numbers, so what it's going to do is produce a collection of numbers at the end where, at each position of this collection of numbers, there will be 10 numbers. here each position is the probability that what was shown in the neural network corresponded to a zero or one, two, three or four, so what you see here is that the numbers are absurdly small, except in the case of four, for example. which we can then deduce from immediately okay, the image was supposed to be a four, so this is sort of the output of the neural network as this collection of probabilities where in this particular case it was actually certain that the thing was a four , so that's what we deduce now.

What we can do is say, well, let's back up a layer in the neural network before we get to that, let's say before we had there's a layer that tries to get the neuron to actually make a decision is I think is a soft Max layer um that uh is um is at the end that is trying to force the decision, is trying to exponentially separate these numbers so that the big number gets bigger and the small numbers get smaller okay, but one layer before those numbers are a little bit more sober in size before you've broken down to make a decision those numbers are a lot more sober in size and these numbers on this layer give a pretty decent indication of the fullness of what we're Seeing as this has more information about what actually was being displayed and we can think of these numbers as giving some kind of signature, some kind of um, some kind of trace of what kind of thing we were looking at. specifying in a sense characteristics of what we were seeing that we will later decide is a four, but all these other types of subsidiary numbers are already useful.

We go back so you know that this is... we can define these feature vectors. which represents this is some kind of feature Vector that represents that image there that is the feature that represents this image here and we see that yes, these features for different fours these vectors will be a little bit different um but they are dramatically different between a four and an eight , but we can use these vectors to represent the important aspects of this four here, for example, and if we go back a couple more layers in that neural network, it turns out that we can get an array of like 500 numbers that are a pretty good representation, a pretty good kind of feature signature from any of these images and we do the same with images of dogs and cats, we can get this kind of signature from what is the type offeature Vector associated with what's important about that image and then we can take those feature vectors and we can say, let's layout things according to different values and those feature vectors and then we'll get this kind of um uh embedded in the case of what we can consider as some kind of meaning space in the case of words if we look at the raw um uh yeah, so how do we do that for words?

Well, the idea is uh the same as for the for to sort of a feature Vector associated with, say, images, we have a task like if we were trying to recognize digits and then backtrack from the final answer, we are training a neural network to perform that task, but what we end up doing Do we go back from that final? We specify the task and say what was there just before you managed to specify the task, that is our representation of the relevant characteristics of the thing, well, you can do the same thing. thing by words, so, for example , if we say the blank cat and then ask in our training data what that blank space is likely to be, you know, is it black, is it white, anything else, um, we could try and create a network that predicts which is likely to be that middle word, what are the probabilities of that middle job, can we train a network to be good at predicting the probabilities of blackness, whiteness versus any other cat tabby or whatever, um and uh, once that we have to then we can go back from the final answer and say, let's look at the innards of the network and see what it had done as it got to the final result, we do that right before we get to a little bit before. comes to the end result which is going to be a good representation of the features that were important about those words and that's how we can deduce what we can deduce these kind of feature vectors for the words, so in the case of gpt2 for example, Can we compute those feature vectors?

They are extremely uninformative when we look at them in the full feature vectors. If what is more informative is that we project these feature vectors into a smaller number of dimensions, we will discover that the cat. You're probably closer to the dog than the chair, but that's what GPT chat does when dealing with words. It always represents them using these feature vectors that, um, using this type. embedding that turns them into these collections of numbers that have the property that nearby words have similar representations, I'm actually getting a little ahead of myself because because of the way GPT chat works, it uses these kinds of embeddings, but does it for entire chunks of text rather than individual words, we'll get there just fine, so I think we're getting along pretty well here.

How about the reality of GPT chat? Well, it's ignoring that millions of neurons, 175 billion connections between them and, what is their basic architecture, is the kind of big idea that actually came out of language translation networks where the task was to start from English and end the French or Whatever else this idea of what's called Transformers was, it's a neural network architecture that more complicated architectures were used before there was actually a simpler one, um, and the notion is, as I mentioned , that when one deals with images, it is convenient to have these neurons together. to the pixels at least to a kind of arrangement in a kind of uh which pixel is next to which pixel in some way there are the so-called convolutional neural networks or conventions are the typical things that are used there in the case of language, which Transformers What one does is they are dealing with the fact that language is in a sequence and with a call for an image, one says there is this pixel here, what is happening in the near neighbor pixels in the image in a Transformer , what one is doing is one saying this. here's a word, let's look at the previous words, look at the words that came before this word, and in particular, let's pay attention differently to the different ones of those words, so I mean this gets pretty elaborate in engineering with quite a bit. speed um and uh uh you I know it's very typical of a sophisticated engineering system that there's a lot of detail here and I'm not going to go into a lot of detail, but um, this is a part of um uh, this is in a sense the front. end of is fine, so remember what GPT chat is ultimately doing: it's a neural network whose goal is to continue a chunk of text, so it's essentially going to ingest the chunk of text so far by reading into each token of the text, tokens are words. or word fragments like things like the ing at the end of a word can be a separate token, they are convenient word pieces, there are about 50,000 different possible tokens, it is reading the text.

The message that wrote the text that was generated so far, it's reading all those things, then it will generate, uh, it's, then its goal is to continue with that particular text, it will tell you every time it reads all this text. The neural network will give you a new token and tell you what the next token should be or what the probabilities of different options of the next token should be, so one part of this is the embedding part where what's happening is the reading . a token and it's doing, I mean, this is just, you know, it's a lot of detail here, so, for example, let's say the sequence we were reading was hello, hello, hello, hello, hello, bye, bye , bye, bye, this shows the resulting um, this shows the embeddings that you get, okay, this shows what you're trying to render.

I said that earlier we were talking about embeddings for words, now we are talking about embeddings for entire chunks of text and asking what it is. the sequence of numbers that should represent that collection of that piece of text and the way you set it up, I mean, again, this is getting pretty deep into the guts of the creature, um and uh, well, what, what, what, what can you think of. There are different components to this embedding vector and let's see what I'm doing here. This image is displayed across the entire page, showing the contribution of each word, and at the bottom of the page it shows what different parts of the feature vector are. is being constructed and the way it works is you take each word and it has um, then the position of the word is encoded by a um uh, you could encode it by saying the binary, the position of the word as a binary digit. that says it's word number seven, you know zero zero zero one or something like that, but that doesn't work as well as essentially learning this kind of collection of random looking things that are essentially placeholders for Words anyway, the result final is I'm going to do this that represents the um uh uh where you have both, where each level is a different type of feature associated with each of these words and uh, that's what will be introduced into the next level of the neural.

Okay, so the next big piece is the so-called attention block which I don't know how much is worth explaining. I talk about this a little more in what I wrote, but essentially what's happening is at the end. It's just a big neural network, but that neural network doesn't have all the possible connections, it has connections, for example, only connections that look back where they look at places that were previously in that text and it's in a It makes sense to focus on that. differently in different parts of that text and you can make a picture here of the amount of attention you're paying and by attention I mean it's literally the number, the, the, the size of actually the weights that is. it's being used for uh with which it's expecting different parts of the sequence that came in and the way it works, I think for um for gpt3 what it does is um uh, so first of all it has this embedding vector that for gpt3 it's about it's 12 288 um I don't know why it's so particular oh yes I know why it's that number is a multiple of things um long and uh uh it's taking it's trying to put together an embedding vector to represent the text so far in the that you've had contributions from words in different positions and it's like you've determined how much contribution you should get from words in each different position, well, okay, so you do that and then you feed everything into a layer of neural networks where, in a certain way So, it has something of a uh, it's a um, what is it?

It's a 12,000 by 12,000 matrix, um, that specifies where there are 12,000 by 12,000 weights that specify for each incoming neuron, each incoming neuron has this weight for this outgoing neuron and the result is that you get this whole set of Weights that don't seem like anything in particular, this is, uh, but these are weights that have been learned through the GPT chat so that they are useful for your task of continuing the text and you know that you can play little games. You can try to visualize those weights by doing moving averages, um, and you can see that the weights are more or less chosen at random, but this shows you a little bit of the details within that randomness and in a sense, you can think of this as a kind of brain view of the GPT chat that shows you the level of these individual weights that are in this neural network, what their representation of human languages is, type in the level that you are at.

I know it's like you take apart a computer and look at individual bits inside the CPU. This is pretty much the same for GPT chat's representation of the language, and it turns out there isn't just one of these. attention layers, okay, what happens is that different elements of the feature vector for text get different blocks of that feature vector that are separated and handled differently, no one really knows what the interpretation of those blocks is, it's just been found to be a It's good not to treat the entire Vector feature the same, but to break it down into blocks and treat the blocks of pieces in that Vector feature differently, maybe there is an interpretation of a part of that Vector feature which is I don't know the words it's about movement or something, it won't be anything like that, it won't be anything as understandable to humans as that, it's something like a human genome or something, all the traits are mixed in the specification , it's or it's like what. um uh it's not something where we can easily have a sort of narrative description of what's going on, but what's been found is that you split this kind of feature vector of text features and you have these separate attention heads that um they have this kind of reweighting process happens for every one you do that and this is where you know it's crazy that things like this work, but you do that, let's see 96 times for the GPT chat that you're doing.

We're doing the same process 96 times and this is for gpt2, the simplest version, this is kind of a representation of the things that come out of these attention layers, attention blocks, what kind of weights. were used, and you know, they may seem like there is some regularity, I don't know what it means, but if you look at the size of the weights, they are not perfect for some layers, they are Gaussian distributed for some layers. They're not, I have no idea what the meaning of that is, it's just a feature of what, um, uh, what GPT chat learned while trying to understand human language from the web, um, okay, um, uh, so again, you know, we.

I've talked about what, in the end, what's happening is that it's just a big neural network and it's being trained from it. We are trying to deduce the weights of the neural network by showing it as a whole. bunch of text and saying uh, what weights do you have to have in the neural network so that um uh, so that the continuation of the text has the right probabilities of what word comes next, that's your goal, so how, uh, and So I have sort of described the outline of how that is done. In the end, one has to feed it.

The reason it is possible to do this is that there is a large amount of training data to feed it, so it has been given a significant fraction of what is in the program. On the web there are Maybe I don't know, it depends on how you describe this, but there are maybe six billion, maybe 10 billion reasonably human-written pages on the web where humans actually wrote those things, they weren't in their mostly generated by machines, etc., etc. etcetera, that is on the web publicly visible without having programs entered and without selecting a lot of different things and seeing what you get, that is something raw, what is on the web page, maybe there are 10, maybe 100 times more that that, um, if you were able to make selections to drill down and access internal web pages, things like this, but you have something like, something like that, you know a number of billions of pages written by humans and there's a convenient collection called common trace that has where, but where. you go to where it is, you know, you start from a web page, you follow all the links, you collect all those pages, you continue, you just follow the links, you follow the links until you've visited all the partsconnected from the web, but um.

The result of this is that there are a trillion words of text that you can easily get from a website. It's also probably 100 million books that have been published, maybe 100, I think the best estimate, maybe 130 million books that have been published. have been published, of which five or ten million exist in digitized form and you can also use them as training data and that's another 100 billion or so, uh words of text, so you have billions of words of text and that's what um uh and there's probably a lot more to it than that if you have the um uh the video transcripts and things like this, you know for me personally.

I've been um uh, you know, kind of a personal estimate of these things. I really realized that the things I have written throughout my life constituted about three million words, the emails I sent over the last 30 years are another 15 million words and the total number of words I wrote is around 50 million. Interestingly, in the live broadcasts I've done over the last few years I've spoken another 10 million words, so it gives a sense of what you know human output is, but the main point is that there are a trillion words available that you can use to train a neural network to be able to perform this task of continuing from things um, it's uh, let's see in um right, so the actual process of um uh one thing to understand about training a neural network is kind of question, okay, there's a question when we looked at those functions before and we said how many neurons do we have to have to represent this. work well, how many training examples do we have to give to train the neural network to represent that function?

In those cases, we didn't need very large neural networks, we needed a lot of training examples, there are all kinds of efforts to understand how many training examples do you really need how big a neural network is? Do you really need to do something like do this text translation? Well, well, it's not really known, but you know, with 175 billion pesos, the kind of surprise is that chat. GBT does it quite well now you can ask the question um what um what is uh uh how much training does it need um and uh how many times do you have to show those billions of words what is the relationship between the billion words and the number of pesos in the um in the network and it seems to be the case that for text um that kind of the number of weights in the network is comparable to the number of training examples, it sort of shows you the training examples about once if you show it too many times , it actually makes your performance worse, it's very different from what happens when you train for mathematical functions and things like this, but one of the things that is a problem is that if you are every time then I should say every time, I should explain by the way that Every time the neural network runs, what happens is you're giving it, in the case of GPT chat, you're giving it this collection of numbers that represents the text. has come this far and then the collection numbers are the input to the neural network, then it ripples through the neural network layer after layer after layer, it has about 400 layers, sort of core layers, it propagates through all those layers and then at the end you get a series of numbers, that series of numbers are actually probabilities for each of the 50,000 possible words in English, um, and based on that, it chooses the next word, but the operation main of GPT chat is very simple, you know you have this text so far, since it is filtered through this network, say what the next result should be, it is very chopped, it only runs once, it is actually very different from the way computers tend to work for other purposes, most In non-trivial calculations, you take the same piece of computational material, the same piece of data, and you calculate it over and over again in sort of simple calculation models, like the Turing machines, that's what happens all the time. time, that's what's happening, that's what makes computers able to do the non-trivial things that computers do or that's that they're taking maybe a small amount of data and they're just reprocessing things over and over. again. in something like GPT chat you have this big network, you just filter it once for each token, the only sense in which there is feedback is that once you get an output you add that token to the input you give it in the Next step is some sort of outer loop where you're giving feedback by adding tokens to the text, then that gets filtered out and then you get another token that gets filtered out, so it's a very big outer loop, probably the case on computers.

In a lot of non-trivial calculations that we do, there are a lot of internal loops that are happening, quite possibly in the brain, there are also internal loops that are happening, but the model that we have in the GPT chat is this type of filtering. It was once a model with a very complicated network, but it only filters once, that's how it works, but one of the things that's complicated is that every time you filter, you have to use each of those weights. So each token the chat produces is essentially doing 175 billion math operations to figure out how to use each of those weights to calculate the results.

Chances are it's not actually necessary, but we don't know how to do better than that. right now, but that's what it's doing every time, if so, it's filtering out by doing it right when you train in the GPT chat and you're training, you know how you handle it, oh, having the weights change based on the loss is another one that you suffer every time you take a training step and you have to do a kind of reverse version of that um of that so-called direct inference process, it turns out that the reverse process is not as much It's more expensive than the direct process, but it must do it many times during training, so typically if you have a model of approximate size n for text, it looks like you need about N squares of computational effort to do the training and N is quite large for the case where you are dealing with a gbt chat size type of language and things, so the training process that that little mathematical square is a is really important and it means that you know spending hundreds of millions of dollars potentially doing the training with the Current GPUs and things like this is what you have to think about doing based on the current model of how neural networks work now?

I mean, I have to say that. that there are many aspects of the current model that are probably not the final model, and you know, we can clearly see that there are big differences, for example, between the things that the brain manages to do, for example, a big difference is most of the time . when you're training a neural network, most of the memory and you're doing it by having a bunch of things in memory and you have some calculations going on, but the things that are in memory are mostly inactive most of the time. time and there is only a little bit of calculation that takes place in the brain.

Each of our neurons is both a place that stores memory and a place that calculates. It's a different type of setup and we don't know it. how to do neural network training there are several things that have been discussed since the distant past, in fact, about how to do this, even since the 1940s people started thinking about distributed ways of doing learning and neural networks, but that doesn't work. is something That's something we can still do, okay, the GPT chat case, one important thing was and this was something you know, six months ago, a year ago, there were sort of early versions of the GPT family, systems text completion, etc. and they were kind of the text that they produced was regular um and um then openai did something with that GPT, which was that there was an additional step, a reinforcement learning training step that was done where essentially what was done was humans le They told Chachi PT, go and do a test, go and be a chatbot, you know, have a conversation with me and the humans rated what came out and said, but that's terrible, that's better, that's terrible, etcetera, etcetera, etc., and what was done.

That's when it turned out that that little push had a very big effect. That little kind of human guidance of yes, you got it from web statistics. Now when you look at what you got. This direction you're going in is a bad direction and it's going to lead you to a really boring essay or whatever, um, and that kind of thing, and by the way, there's a place where there's a lot of complications about, well, what. Do humans really believe that the network should be what the system should be producing? If humans say: We really don't want you to talk about this.

We really don't want you to talk about that. That is the place where it is injected. This is in this reinforcement learning step, at the end, but what you do is, for example, given the way that humans touched those trials, you can see what they did when they touched those trials and rate what happened. and so on, and you can try to automatically learn that set of things that humans did, then you can use that to provide a lot more training data to then retrain some of this, do a retraining of the network based on the type of Al adjust what the humans did, you can do a kind of fine-tuning of this network based on the particular touch that the humans did became another Network that can then be used to do the training and produce the examples to do the training from the mainline ones, so that's something that seems to have had a big effect on the real kind of human perception of what's going on in the PT church and I think that um uh the other thing that's kind of um um The surprise is that you can give it these long prompts where you tell it all kinds of things and then it will use that in a fairly human way to generate the text that comes after.

Well, the big question is how is this possible? It works, why is it that a thing that you only know about 100 billion pesos or something like that can reproduce this kind of amazing thing that seems to require all the depth of human thought and the brain and things like human language, how does that work? They manage to work and I think that, um uh, the key to realizing that is that what it really tells us is a scientific fact, it tells us that there is more regularity in human language than they thought than what we thought, it tells us that this, this, that's that.

This human language has a lot of structure and what it does is it learns a lot of that structure and it's a learned structure that we didn't even notice was there and that's what allows it to generate these kinds of plausible pieces of text that are you know what we're doing. use of the structure that we know, so we know certain types of structure that exist in the language, we know the um, so, for example, um, here's an example, a piece of the structure that we know, share this again, a piece. of structure that we know is grammatical syntax, um, syntactic grammar, we know that sentences are not random mixtures of words, sentences are made up of nouns in particular places, verbs in particular places and we can represent that by a parse tree in the one that let's say you know here is the complete sentence there is a noun phrase a verb phrase another noun phrase these are divided in certain ways this is the parse tree and there are certain for this to be a grammatically correct sentence this has only there are certain possible forms of parse tree that correspond to a grammatically correct sentence, so this is a regularity of the language that we have known for a couple of thousand years, it has really only been codified, it was a big effort to codify it in 1956, but it was something So. of knowing that this general idea has been known for a long time, um, but then, um, um, that we can represent the kind of syntactic grammar of the language by these kinds of rules that say that you can put nouns only together with verbs in this way and that and to any set of rules and this has been a great source of controversy and Linguistics that any set of rules that you can define there will always be some strange exception where people normally say this instead of that, but yes I know that It's very similar to what happens in typical machine learning.

You know if you are interested in 95 result then there are only rigid rules and there are some exceptions here and there. Well, that's a form of regularity that we know exists in language. Is this a syntactic regularity? Now what can we do? We can ask for a kind of chat. GPT has effectively learned this syntactic grammar implicitly. No one ever told him that verbs and nouns go this way and that's the way he learned it implicitly. by virtue of seeing a billion words of text on the web that have these properties and when you say right what are the typical words that follow well, they will be the words that followed in the examples that you had and those will still be mostly correct. grammar now we can we can take a simpler version of this we can understand what's going on we can take a very very trivial grammar we can take a grammar that is just aparentheses just opens and closes parentheses and something is grammatically correct if we open parentheses and they always eventually close and this is a parse tree for a um uh for a parenthesis uh you know open open open close open close Etc, etcetera Etc, this is the parse tree that shows how you can do it, it's a representation of uh like, um, parsing this sequence of open and closed parentheses, okay, so we could say, can we train a neural network to know what it would take? to train a neural network to know even this particular type of syntactic grammar?

We looked at a simple how big it was, it was quite small, okay, we made it a Transformer network with eight heads and a length of 128, so, but ours was much simpler than GPT chat, but you. Can you use one of these Transformers and if you look at the post I made, there is the real Transformer and you can play with some important language? But anyway, if you give that Transformer this sequence. here and you say what comes next says okay uh well 54 probability that there is a closed pair there based on oh your training data was a randomly selected collection of correct sequences of open close open close parentheses um uh it has a little bit of error here because it says with .0838 probability, this is the end of the sequence, which of course would be grammatically incorrect because there is no closure for this, there are parentheses to open here if um uh, if we give something that closes correctly, then it says "ok", great, there is a probability of 34 this is the end of the sequence there were no more openings here there is a bit of an error here because it says 15 probability that there is a closed parenthesis that should occur here, which cannot be correct because if we put a closed parenthesis here it doesn't have a corresponding open parenthesis, it's not grammatically correct, but anyway this gives an idea of what is needed for one of these Transformer Nets.

We can look inside this Transformer Net. We can see what it took to learn this very simple grammar. GPT chat is learning the much more complicated grammar of English. In reality, it's probably easier to learn English grammar because there are so many clues in the actual words used about how they combine grammatically and there are so many things that humans would do. We don't even realize that they're wrong in some sense because they're pretty much what we do, but in this more austere case of this kind of mathematically defined parenthetical language we do notice, so if we give it a bunch of open parentheses open brand Etc. and asked which is the most likely sequel.

You'll find that you're doing pretty well up to this point and then you start to lose it and it's a little bit like what would happen with humans. You know we can. I will say at some point here that just by looking that they are closed correctly, it becomes more difficult to tell when we come out here and it becomes more difficult for the network to tell that as well and this is a typical characteristic of these neurons. Nets that with these kind of superficial questions of oh, you just have, you know, you can see this block of things, you see another block of things, it does well when it has to go much deeper, it doesn't work as well. for a sort of normal computer that can do loops and internal things, it's very easy to figure out what's going on here because effectively it just counts the number of open Marks, counts down the number of closed Marks and so on, by the way, if Try this in real GPT chat.

It will also confidently claim that they are matching parentheses, but will often be incorrect. Sequences of larger parentheses have exactly the same problem. It fails at a slightly larger size, but it will still fail. that's just a characteristic of this kind of thing, so, well, okay, one type of regularity in the language that GPD has learned is syntactic grammar, another type of regularity, there is one more that you can easily identify and that is logic and what it is. Well, originally logic was invented by Aristotle, as far as we know, what Aristotle did was effectively a bit like a machine learning system.

He looked at many examples of rhetoric, many example speeches that he gave people, he said what they are. some forms of argument that appear repeatedly if someone says you know uh something like people might have said you know all men Immortals Socrates is a man, therefore Socrates is Mortal um all all the X's are y um uh Z is a is a is an they are meaningful sequences and originally in such logistical logic which is what Aristotle originally invented it, it was really based a lot on language and people memorized it, you know, people in the Middle Ages memorized these forms of syllogism, the syllogism of Barbara, the celerant syllogism, etc., which were just these patterns of word usage where you could substitute a different word. for Socrates, but it was still the same pattern, that same structure, so it was another form of regularity and when the GPT chat says it's oh, you're solving things, part of what you're solving is that you know the logistical logic because you've seen millions of examples like Aristotle, presumably saw a lot of examples when he invented logic.

He had seen a lot of examples of this sentence follows this sentence this way and then he can do it too and say what it is. the statistical thing that's going to happen based on the web um and that's um uh, by the way, when logic was developed in the 19th century, when people like Bool were coming on the scene and doing formal logic, um, it wasn't just these anymore. boom patterns, it's a pattern that looks like this, it was more like this, you could build many layers of structure and you could build, you know, very complicated logical expressions where everything was deeply nested and of course our computers today are based on those deeply nested Logical Expressions GPT chat doesn't have any chance of decoding what's going on with one of those deeply nested mathematical computational style boolean expressions, but it works fine at this kind of Aristotle level, um uh, you know, structure of a sort of template-based logical structure, well, I wanted to talk just for a moment and then we should finish here and I can try to answer some questions, um, uh, uh, about what are the regularities that Chachi PT has discovered in this What we do is language and all the thinking that revolves around language and I don't know the answer to this.

I have some ideas about what's going on. I'll just give you a little tour. we talk about the kind of space of meaning, the kind of space of how words are organized in some way, how words can be organized in some kind of space of meaning and we can see how words are organized, these are different parts of the discourse for a given word there can be different places in the meaning space where different instances of that word occur this is the word Crane and these are different sentences there are two obvious meanings of crane, you know, the bird and the and the Machine and divided into that is, the space where they are, we can see the type of structure of the meaning space, another thing we can ask is if the meaning space is like the physical space, is it true that there are parallel lines in the meaning space? , are there things we can go to from your place? a to place B and we and then in parallel transport ourselves to new places, so we can ask you if we have analogies.

Is it true that we can go from woman to man, from Queen to King? Those are kind of parallel. parts and meaning space the answer is well maybe a little lame that's really the question in in space in physical space this is the question if this is like a flat space it's like we have things moving in a flat space you know , Newton's first law says that if a force does not act on the thing, it will continue to go in a straight line, so we have gravity and we can represent gravity by talking about the curvature of space.

Here this question is when we go from uh you know, heard. so far from I to C, those are kind of uh, we're moving in a certain direction in the space of meaning and in a certain sense, the question of whether these things correspond to whether we can do this kind of parallel transport idea is something like how flat means space, how much effective gravity is in space or something like that, I mean, space is probably not something that is represented in terms of the kinds of things that physical spaces represented in terms of, but that's a question, so now when it comes to operation of the BT tractor we can think about how it moves in the meaningful space, it has its message, you know, the best thing about AI is its ability to accept, um and uh, that's the message that moves in the meaningful space effectively and then what it does is that it continues to move in the space of meaning and therefore the question is whether there is something like a semantic law of motion, analogous to the laws of the movement that we have in the physical space in the space of meaning of Concepts words something where we can say it's okay if it's gone if it's moved this way it's like it has momentum in this direction and it means space is going to continue in that middle space is not at all that simple but the question is how do we think about how we represent the kind of process of going through the space of meaning?

Well, we can start to see that we can say, for example, the different possible continuations that we get from the best of the AI's ability to and then what's the next word? Well, we can look at this kind of fan of different directions that could go in the meaning of space at that point and we can see that there is some direction in the meaning of space that tends to go in that direction. Not getting this far, at least not with a high probability, okay, well, if we continue, we can see how that fan develops as we go and continue with that sentence, and we can do something like that.

It's like our question about movement and meaning space, and you know, I still don't know what this means exactly, but this is what the trajectory and meaning space looks like when the GPT chat tries to continue a sentence, it looks like the green. is what it actually shows, I think this is a zero temperature case and the gray things are the other things that were lower probability cases, so that's kind of like um uh, that's something up to you if we want to look We don't want to. want to do natural sciences in the gbt chat and say what he discovered what he discovered about how language is constructed.

One possibility is that there are these kinds of semantic laws of motion that describe how means how. you move through the space of meanings as you add words to a fragment of text. I think a slightly different way of thinking about this is in terms of what one might cause in semantic grammar, so syntactic grammar is just about knowing nouns, verbs, things like what parts of speech things like that, but we can also ask if there is a generalization of that that is more semantic that doesn't just look at that it has finer gradations and simply say that it is a noun, it is a verb and say oh well that verb means movement and when we put this noun together with this noun , that's something you can move along with this movement word, it does this, we have cubes of meaning that are finer gradations than just parts of speech, but not necessarily individual words.

There is a kind of semantic grammar that we can identify, which is a kind of construction kit for how we put together not only sentences that are grammatically correct and that are syntactically grammatically correct, but sentences that are somehow semantically correct. Now that, um, I firmly believe that this is possible and it's kind of what Aristotle was looking for, he even talks about categories of um uh kind of semantic categories and things like this, he talks about a variety of things, he does it in a way. which is based on the fact that It was two thousand years ago and we didn't know about computers and we didn't know about many kinds of formal things that we know now.

It's quite strange how much work has been done trying to create some sort of semantic grammar. in the last 2000 years has been quite small, there was a small effort in the 17th century with people like Leibniz with his characteristic universalis and various other people who tried to make what they call philosophical languages, a kind of language, word, independent forms of describe. meaning and then their more recent efforts, but they have tended to be quite specific, quite based on linguistics, um and uh, quite based on the details of the structure of human language, etc., and I think this is this, this idea that you can write. of having a semantic grammar is and that is what is being discovered is that there are rules that go beyond that, they are just rules about how to put together a meaningful sentence, now you know that you can get a meaningful sentence, it could be something likethe elephant flew to the moon that phrase means something sure means something has a perfectly we can conjure up an image of what that means has happened in the world no, it hasn't happened as far as we know um and uh so there's a but you know, could I be in a story, could it be in a fictional world?

So this this kind of semantic grammar will allow you to put together things that are somehow, um, meaningful things to describe about the world um the question of whether they were realized in the world or have been realized in the world is a separate question but anyway the thing that um uh that's interesting to me about this is that it's something that I've thought about for a long time because I've spent a lot of my life building a computational language, well, from a language, a system that is an effort to represent the world computationally, so to speak, to take the things that we know about chemicals or lines or or images or anything else and have a computational representation for all of those things and have a computational language that knows how they work all those things, he knows how to calculate the distance between two cities, he knows all those kinds of things and in um and so I've been spending the last four decades trying to find a way to represent things in the world in this computational way so that then you can compute, uh, then you can compute things about those things, uh, in an explicit computational way.

It's something that we've been very successful at being able to do that, in a sense, the history of modern science is a history of being able to formalize many kinds of things in the world and we're taking advantage of that in our computational language to be able to formalize things in the world. world to calculate how they will work now. One feature of Computing about how things work is that inevitably some of those calculations are Deep calculations are calculations that something like a chat GPT can't do, and in a sense there's sort of a difference between things that are the kind of superficial calculations that you can learn from examples in something. like a GPT chat you can say this piece of language that I saw on the web here, you know statistically, uh, I can fit in this place, just fitting these kinds of pieces of puzzle language together is a very different thing from Taking the world and representing it in some truly computationally formal way for you to be able to calculate things about how the world works is kind of long before people thought about this idea of formal formalism, maybe 400 years or more ago.

Well, you know, all anyone figured out was just thinking about it in terms of language, in terms of words, in terms of sort of immediate human thought. What emerged with mathematical science at first and then with computing was this idea. to formalize things and get these much deeper ways of deducing what happens and what I discovered 30 or 40 years ago was this phenomenon of computational irreducibility, this idea that there really are things in the world where you can calculate what is going to happen, you have no choice but to go through all those computational steps, you can't just skip to the end and say I know what's going to happen, it's kind of superficial, so you know when we look at something like GPT chat.

There are certain kinds of things you can do by matching these pieces of language, there are other kinds of things you won't be able to do, you won't be able to do a kind of mathematical calculation, the kind of thing that requires a real computational representation of the world for things like us humans, it's kind of a used tools type situation and very conveniently our Wolfram Alpha system which, um, used in a lot of smart assistance, etc., has In this image a wolf computing language is used and language below , but it actually takes natural language input, so it's actually able to take the natural language produced by a chat GPT, for example, take that and then turn it into computational language.

Do some calculation work. the results get the right answer, go back to the GPT chat and then you can make sense, so to speak, instead of just following the sort of word statistics on the web, so it's a way of knowing by allowing it, but you can sort it . of Best of both worlds by having something where you have this kind of language flow, um, as well as something where you have the kind of depth of calculation by having gbt chat, using Alpha's wolf as a tool and I wrote a bunch There's a lot of stuff about that and all kinds of things going on with it, but what you know talking about what GPT discovered in the chat.

I think what he discovered is that there is a semantic grammar for many things, there is a way. to represent, using a kind of Computational Primitives, many things that we talk about in the text and in our computational language, we have representations of many kinds of things, whether they be foods, chemicals, stars or anything else, but when it comes of something like I'm going to eat a piece of chocolate we have a great representation of the piece of chocolate we know all of its nutritional properties we know everything about it um but we still don't have a good representation of I'm going to eat well the part I'm going to eat, I think what GPT has shown us is that it's very plausible to get a kind of semantic grammar of how one has these pieces of representation of these kinds of masses of meaning in language and I think it's going to happen and I've been interested in doing this for a long time.

I think this is finally the impetus to really roll up our sleeves and do it, um, it's a bit of a complicated project for a variety of reasons. no less important, but you have to do this kind of uh uh well, you have to be, you have to do a kind of process of Designing a language is something that I have been doing for 40 years designing our computational language, this is a design problem of language and in my opinion they are actually the most intellectually concentrated and difficult thing I know is this language design problem, so this is kind of a generalization of that, but I think GPT has shown us what I know.

I didn't know how difficult it was going to be. Now I am convinced that it is also possible to talk. So what does this mean? um uh, you know, you could ask the question, you know, people could have said, "Okay, look, you know, we, we." We have seen neural networks that convert speech to text, we have seen neural networks, let's do image identification, now we have seen neural networks that can write essays, surely if we have a large enough neural network, it can do everything well, not neural networks They are of the type Until now we have the training structure that they have until now, not on their own, they will not be able to do these irreducible calculations.

Now these irreducible calculations are not easy for us humans, you know, when it comes to doing mathematical calculations or worse if someone says here is a program, run this program in your head, good luck, you know, very few people can do that, um, um, is something where there is a difference between what is immediate and easy for us humans and what is computational. Now it is possible, another question is that maybe we don't care about things that are not easy for humans. It turns out that we've built a lot of good technology over the last few centuries based on what amounts to a much deeper level that we have.

Actually, in our technology we don't go even that far into irreducible computing, but we go far enough that it goes beyond what humans can easily do, but what we can do with the kind of neural networks that we can do. They exist today, so I think that's the kind of thing that you have to understand that there is a certain set of things, what is happening in the GPT chat, it's like taking the average of the web plus the books, etc. , and he says: you know I'm going to fit things together based on that and that's how he writes his essays and it's and when he deduces things when he does logic things like that what he does is he does logic like the way Aristotle discovered logic he's figuring out oh there's a pattern of words that looks like that and it tends to follow like that because that's what I've seen in a hundred thousand examples on the web, so that's what it's doing and it gives us an idea of what it's going to be able to do and I think the most important thing you can do is a form of user interface that you know we can get.

I could get something where I know, oh, what really matters is three bullet points, but if I'm going to communicate that to someone else, they're not really going to understand my three points, they need to understand that they need something that's a full essay describing , you know, that's the human interface, so to speak, it's like you could have done it. you know raw bits or something and that wouldn't be useful for us humans we have to wrap it in a human in a human compatible way and language is our richest human compatible medium and what is chat doing gbt.

It is capable? I think what's the way to think about it: it's providing this interface, well, it's just generating pieces of language that are consistent, but if you give it specific things that it's going to talk about, so to speak, then it's kind of wrapping up the details. with this interface that corresponds to the type of fluent human language, okay, I went on much longer than I intended um and uh uh I see there are a lot of questions here and I'm going to move on from um and try to address some of these as a question of antipasts or constructed languages like Esperanto, more susceptible to semantic grammar.

Very good AI approach, very interesting question, so I think the one I was experimenting with was the smallest of the constructed languages, a language called tokipona. which is only 130 words, um, it's not a language that allows one to express, you know everything that one could want to express, but it is a good kind of Small Talk type language, a small language for small talk, so to speak, but it expresses a lot of decent ideas, so I was, I was going to see yeah, it's a good clue again for semantic grammar, that there are these little constructed languages, it also helps, um, I think well, I also think probably the biggest is the The constructed language is ethical is another interesting source, it is a language that is trying to incorporate all the kind of linguistic structures of all the known languages in a first approximation, um, that's something, um, yes, that's it, like that yes.

I think the answer is yes. I think they are a good stimulus to think about semantic grammar in a certain sense. When people were trying to do this back in the 17th century, they were very confused about a lot of things, but you know, one of them gives A lot of them have come a long way since it was the 17th century. They were confused about things like whether the actual letters that were written while writing The Language mattered and what that was like, you know more than the structure. of things but but uh there was the beginning of that um uh that kind of idea um okay, I'm going to take these from the envelope I want to go back to some of these others um okay, Tori is asking how we're going to study, what's the La best way to start GPT chat a semantic move from Lord could be useful, certainly, yes, I don't know the answer.

I think that's a good question and I don't really know, um, um, you know, I think, yeah, no. I don't know, Albert is asking if the 4000 token limit is analogous to working memory, but accessing larger memory may be increasing token limits or increasing those capacities, reinforcing learning. I think the token limits that exist right now, do you know if you want? If you have a coherent essay and you want it to know what it was talking about in the first part of the essay, you better have enough tokens that are fed into the neural network every time it gets a new token, if not.

I don't know what he was talking about if he forgot what he was talking about 5000 tokens ago, he may be saying totally stupid things now because he didn't know what was there before, so in a sense it's like I don't think it was me. I don't think it's our short working memory, but I think it's like you ramble. I ramble a lot, you know, talking about things and like half an hour later I'd forgotten that I already talked about it. I might be telling the same story again I hope I don't do that I don't think I do that bad um but you know that's a question of what the kind of things that happen with this token limit um come on look let me go back to some from the questions that were previously asked here um okay uh Aaron was asking talking more about the tension between superintelligence and computational irreducibility how far cinematic intelligence can go.

I think I talked a little bit about that. I think this. question, oh boy, this is a little complicated, I mean, this question about okay, the universe, the world is full of computational irreducibility, that is, it is full of situations where we know the underlying rules but we execute them as a calculation and it is not possible. shortcut the steps what we have discovered in our physics project is that it seems that the lowest level of space-time works that way, in fact today I saw a beautiful work um uh about uh um doing a practical simulation of space-time and things that use those ideas and support this again, is really computationally irreducible at the lowest level, just like in something like agas, the molecules bounce around in this computationally irreducible way.

What we humans do is sample some sort of aspects of the universe that have enough reducibility that we can predict enough that we can go about our lives as if we weren't paying attention to all those individual gas molecules bouncing around, we only pay attention to the aggregate of the gas pressure or whatever otherwise we don't pay attention to all the atoms in space, we only pay attention to the fact that there exists something that we can think of as more or less continuous space, so that our history has been a history of finding portions of reducibility, portions of places where we can predict things about the universe there are many things about the universe that we cannot predict, we do not know and if our existence depended on those things if we had not found these reducibility portions, uh, we wouldn't, we wouldn't. be able to have a coherent existence of the kind that we have if you ask where are you going with that well, they are an infinite collection, there is an infinite kind of network of computational reducibility pieces, there is a kind of infinite set of things to discover about which we have discovered some of them as we advance in our science and with our technology for things we can explore more of that type of network of reducibility, but that's really the problem now, the problem is that the way we humans react is that we have ways of describing what we can describe, we have it, we have words that describe things that are common in our world we have a word for a camera we have a word for a chair that kind of thing we don't have words for things that haven't been there yet been common in our world and you know when we look In the bowels of Chachi PT all kinds of things happen, maybe some of those things happen quite often, but we don't have words for those, we don't have a way, we haven't found a way yet. way to describe them when we look at the natural world, there are things that we have seen repeatedly in the real world, we have words to describe them, we have built this kind of descriptive layer to talk about things, but one of the things that happens is that if we jump to some somewhere else in the universe of possible calculations, there may be pieces of reducibility there, but we don't have words to describe those things, we only know about the things that are. close to us, so to speak, and gradually a science advances, we can expand the domain that we can talk about, so to speak, everything advances, we can have more words, we can talk about more things, um, but in a Have something that operates it's this gradual process where, in a sense, we socially learn more concepts that we can exchange, concepts that we can build on those concepts, etc., but if we get thrown somewhere else in what I call the rouliad the uh the space of all the possible computational processes if you throw us in an arbitrary place there we will be completely confused because there will be things we can say there are real calculations going on here there are things happening there are even parts of reducibility but no we don't relate to those things so it's like imagining that you were, you know you're here now, and you're chronically frozen for 500 years and you wake up again and there's all these other things in the world and it's hard to reorient yourself to all those other things without having seen the steps in between and I think when you talk about where you can go from what we have now, how can you order? to add more, basically your intelligence has to do with these kinds of reducibility pieces, these ways of moving forward and not just saying that's what we think about as a kind of human intelligence it's about those kinds of things and I Think about um um uh, you know, what's the vision of what's going to happen, you know, when the world is full of AIS, kind of interesting because we've actually seen it before, I mean, when the world is full of AIS and they're doing it. all these things and there is all this computational irreducibility, they are all these pockets of reducibility that we do not have access to because we have not incrementally reached that point, what is going to happen is that there are all these things that happen between the AIS and they are happening in this layer that we don't understand, they're already happening and in many places on the web and you know, bidding on ads or showing you content on the web, whatever it is, there's an AI layer that's happening that we don't understand particularly well. that we have a very clear model for what nature is.

Nature is full of things that happen that are often computationally reducible, but we don't understand that what we have been able to do is forge an existence in such a way. To speak coherently for us, even though there is all this computational reducibility, we have these little niches with respect to nature that are convenient for us as humans, so to speak, and I think it's pretty much the same with nature. the world of AI as it becomes like the natural world and becomes something incomprehensible to us, that's, um, we're, we're, um, um, you know, our vision has to be, oh, that's just you know the operation. of nature, that's something I'm not going to understand, oh, that's just how AI works, I'm not going to understand that there's this piece that we've actually managed to humanize and that we can understand, so that's a little bit of I thought about how that plays out.

In other words, you know you can say I'm going to throw you somewhere random in the rouliad. Incredible calculations are happening. It's great, it's nice. I have spent much of my life studying. That kind of thing, but to push it back, to turn it into something that has a kind of direct human understanding is a difficult thing. Aaron is asking more of a business question about Google and the Transformer architecture and why you know it's been a Something very interesting is that the kind of neural networks with this small field was very fragmented for many years and then all of a sudden things started went live in 2012 and much of what worked and was actually worked on was done in a small number of large technology companies. and some not so big tech companies, um and uh uh, it's a different picture of where innovation is happening than what has existed in other fields and it's kind of interesting, it's potentially a model of what's going to happen in other places, but You I know that it is always complicated what causes one group to do this and another group to do that and it is the entrepreneurial people who are smaller and more agile and they are the people who have more, more resources, etc., it is always like that. complicated, okay, Nicola was asking if you think pre-training a large biologically inspired language model might be feasible in the future.

Don't know. I think figuring out how to train something that you know we don't know. knowing which parts of biology are important one of the incredibly important things we just learned is that there probably isn't much more to the brain that really matters for its information processing than the neurons and their connections, etc., that might have been . The case is that every molecule has some quantum process going on and that's where the thinking actually happens, but it doesn't seem to be the case because this is the pinnacle of our kind of thinking powers of being able to write long essays. and so on, it seems like that can be done with just a bunch of neurons with weights, now that other parts of the biology are important, like uh uh, you know, actually Terrace just wrote this article talking about how there are more regresses uh uh neural. connections in the brain than those that go forward, so in that sense it seems like maybe we missed the point with these feedback networks that something like GPT chat is basically and that feedback is, you know, it's really important. , but we don't know yet.

We don't have the right idealized model of that yet. I think, you know, what's the next kind of McCulloch mascot, what's the next kind of simple meta model of this, I think that's important as well. which there are probably a lot of essential mathematical structures to learn General mathematical structure to learn um no, that's what you know, I was interested in yours dating back to around 1980 um and I was trying to simplify, simplify, simplify, models of things and neural networks. I passed on them because they weren't simple enough for me, they had all these different weights and all these different network architectures etc., and I ended up studying cellular automata and generalizations of that where you know you have something where everything is much simpler there are no numbers real there are no arbitrary connections there is no this that and the other things but what matters and what doesn't um we just don't know that yeah uh uh Paul is asking what about a five sense multimodal model to connect the system to the real world with a similar real experience to the human.

I think that will be important and it will definitely happen and you know you will be more human. Look at this chat GBT is quite human. -like when it comes to text because, gee, you just read a large fraction of the text that we humans at least write publicly, um, but you didn't, you haven't had the experience of going up the stairs and doing, ya you know, do this. or that thing and that's why he's not going to be very human when it comes to that kind of thing if he has those experiences then I think we'll get to um uh you know, so that'll be interesting um okay someone's commenting on the fact that which should make the same kind of description for image generation, AI generated for images, um, what I like to think is that I think it's one of our first moments of communication with an alien intelligence, in other words, in a sense , we're talking to the generative AI in English words or whatever and it goes into its alien mind, so to speak, and pulls out the things that are these images and so on, it's less, so you know with the GPT chat what the result.

It is something that is already meant to be very human It is a human language with an image generation system It is rather it is producing something that has to be something recognizable to us It is not a random group of pixels It is something that resonates with things It we know, but in a sense he can be more completely creative in what he shows us and in a sense, when you try to navigate the space of what he's going to show us, it feels very kind. of you are communicating with an alien intelligence and it's kind of uh, it shows you things about how it thinks things by saying oh, you said those words, I'm going to do this and so on, I mean, I have to say that.

I'm, if we can't, you know, the other examples of alien intelligence is that we have all over that planet, there are many, many dating bugs, so to speak, um, uh, and I have to believe that. If we could correlate the type of experiences of those Critters, cats, dogs, you know, cockatoos, whatever else, um, and the vocalizations that they have, etc., and we could know that it's time to talk to the animals, so to speak , I think it's that. It feels like you already know the kinds of things we learned from the GBT chat about the structure of human language.

I'm pretty sure that if there is any linguistic structure for other animals, it will be similar because it is one of the lessons of biology are: you know there are fewer ideas than you think, you know these things that we have, we have precursors in biology a long time ago , we may have made innovations in language, it's kind of the key innovation of our species, but everything out there had precursors. in other organisms and that's what, um, and the fact that we now have this much better way of discovering a model for language in humans means that we should be able to do it in other places as well, uh, okay, David is saying Los gpt chat developers seem committed to injecting some sort of political restrictions into the code because, to prevent controversial topics from being talked about, how do you do that?

It is done through this stage of reinforcement learning. I think maybe there's also something real, you know, if you're starting to use these words just stop, write things down. I think maybe that's being done a little bit more with Bing than with GPT chat at this point, um, I think um uh, I have to say one thing that I consider you know, as far as I know, GPT chat is G-rated, you know, and that's an achievement in itself that it doesn't, um, maybe I shouldn't say that because probably maybe they are a horrible counterexample to that. but I think it was a um, you know, in terms of one of the things that happens, well, you have a group of humans and they're giving them this training and those humans have opinions and they'll let you know that there will be this guy. of politics or that type of politics or they will believe in this or that or the other and they are uh, you know, whether they are on purpose or not, you know that they are going to impose those opinions because there is no opinionist. what you are doing when you tell the GPT chat that the essay is good that essay is not good you know on some level that it is an opinion now that opinion may or may not be colored in something that is about you know politics or something but it is something inevitable that you have that, I mean, I have to say you know something that I've thought about a little bit in relation to sort of the general injection of AI into some of the things that we see in the world, like social media. content and so on.

I tend to think that the correct way to solveThis is saying, "Okay, let's have several chatbots or whatever, and in fact, they are trained on different criteria by different groups under different banners, so to speak, and you know." You can choose the banner of the chatbot you want to use and then you're happy because you're not seeing things that horrify you, etc., and you can discuss whether you want to choose the chatbot. that accepts the most diverse points of view or if you want to know that that is something that brings you back to standard questions of political philosophy and things like this, I mean, I think what you have to realize is that there is a kind of ethics, you know that one wants to put ethics in some way in what is happening, but when one says, let's have the AIS, you know, do the ethics, it is as if it were a desperate ethics.

There is no mathematically definable type of perfect ethics. Ethics is a the way humans want things to be and then you have to choose, you know well, is it average ethics? Do you know the ethics that make only five percent of people unhappy? That and that are old questions of political philosophy that, as far as we know, have no good answers, but once you launch into those questions there is no , you know, oh, we'll get a machine to do it and it'll be perfect, it won't happen because these are questions that have no solution for a machine because they're questions that in a sense come directly from us.

These are the things to keep in mind about GPT chat in general. GPT chat is a mirror. about us has been taken what we write on the web, so to speak, on Aggregate and that reflects on us to the extent that it does dumb things and says dumb things, you know, some of it is really up to us, I mean , you know it's ours. that's pretty much the average type of website we're seeing here um tenacious is asking about a particular article, which sounds interesting but I don't know uh let's see here soon um okay tragath, I was wondering how the AI compares neural networks with other living multicellular intelligences, uh, plant roots, um, nerve networks and things like jellyfish, etc., biofilms, yeah, well, okay, so one of the big things that has emerged from a large number of Science that I've done is this thing that I call the principle of computational equivalence, which essentially says that as soon as you have a system that is not computationally trivial, it will ultimately be equivalent in its computational capabilities and that's an important thing when we talk about computational irreducibility because the Computational irreducibility arises because you have a system doing its calculations there is no system that you can't expect there to be all other systems will simply be equivalent in their computational sophistication you can't expect a super system to be I'm going to go ahead and just say: Oh, you went through all these computational steps, but I can move forward and get to the answer now.

One question that's really good is when we look closely, one of the things that are characteristic of our consciousness, for example, in relation to the whole computational irreducibility of the universe is the fact that we have coherence. Consciousness is a consequence of the fact that we are two things. It seems to me that we are computationally limited. We are not able to look at all those molecules. bouncing around we just see various added effects point one and point two that we are uh we believe that we are persistent over time we believe that we have a persistent thread of existence Through time it turns out that a great fact of our later years for me is that the great facts of physics, general relativity, the theory of gravity, quantum mechanics and statistical mechanics, the second world entry law of thermodynamics increase the three great theories of physics that emerged in the 20th century, all three can be derive from knowing that we human observers are noticing those laws and we human observers have those two characteristics that I just mentioned.

I think this is a beautiful, very important and profound result about the fact that we observe the physics that we observe because we are observers of the kind that we are. Now an interesting question, I guess, is that when we are so limited, we are computationally limited things and the very fact that we observe physics the way we observe physics is a consequence of those computational limitations, so the question is what? so similar are the computational limitations in these others. types of systems, in a sense, the fungus as an Observer, if you will, how similar is that type of Observer to a human Observer and in terms of the computational capabilities that it has, etc., I suppose it is quite similar and in fact , one of my next projects is something I call Observer Theory, which is a kind of general theory of the types of observers you can have of things and maybe we'll learn something from that, but it's a very interesting question, Dugan. is commenting um uh GPT chat can be improved using an automated fact checking system like an adversarial network, for example, could one basically train gbt chat with Wolfram Alpha and improve it?

The answer surely goes only so far, but then it will. you will lose it just as you do with parentheses. I mean, there's something about a network of that architecture. There is a certain set of things one can learn, but one cannot learn what is computationally irreducible. I mean, in other words, you can learn the common cases. but there will always be surprises, there will always be unexpected things that you can only access by explicitly doing those calculations that Bob asks. He can track GPT. Play a text-based adventure game. I bet so. I don't know. I haven't seen anyone try that, but I bet you can't.

Okay, there's a question here about software. Besides being trained on a huge Corpus, what is it about gpt3 that makes it so good at the language? I think I tried to talk about it. a little bit about the fact that it's um uh that there's regularity in the language. I think the details of the Transformer architecture kind of look back at sequences and things that have been helpful in refining the way you can train it. um and that seems to be important uh let's see um Victoria's question could the future impact scores help us better understand GPT, so what it is is that when you run a neural network you can tell uh more or less how much How much did it affect some feature? particular to the output that the neural network gave to the GPT trap?

It's really quite complicated. I mean, I started doing research trying to understand what kind of natural scientist you know I am like I could. I don't do any kind of neuroscience with real brains because I'm a hundred times a thousand times too squeamish for that, but you know I can do research inside an artificial brain and I started trying to do that and it's hard, I mean, I did it. Don't look at feature impact scores. I think one could um, uh, okay, then, um, but by the way, I'm amused by these questions because, as I can, I can still tell you that they're not Bots. think and um uh let's see um Ron is asking about implications like I have to work late tonight, what does that mean? um, yeah, absolutely, chat, GPT is learning things like that because as it turns out, you know a lot of text that says I have to work, work. late tonight, um, so I can't do this, examples of that have been seen, it's kind of doing the Aristotle thing again, it's just seeing this, uh, you know these language patterns and that's what it's about. learning, so to speak, um, so yeah, these things we could say.

How do we think about that formally? Oh, it seems a little complicated to us, but that language pattern occurred before. Well, the last thing, the last thing, maybe, um, okay. Albert asks: Do you think humans learn efficiently because they are born with the right? networks to learn the language more easily or is there any difference. I think it's important, the architecture of the brain is certainly important. I mean, you know, my impression is that there is, you know, it's a coincidence for neuroscientists to go and discover. Now that we know that certain things can be made to work with artificial neural networks, did the real brain also discover those things and the answer will often be yes?

I mean, just like there are things that we've probably learned from drone flying or airplane flying that we can go back and say oh, we did biology, we actually already have that idea um, I think um uh, there are undoubtedly features of language. human that depend on aspects of the brain, I mean, for example, one that you know Talking to Terry Zanovski, you know that we are talking about the loop between the basal ganglia and the cortex and the possibility that you know that the external loop of the GPT track It's a bit like that. Loop and it's like it's spinning.

In my mind, one could say that maybe it's actually a loop of data circulating in this literal loop from one part of the brain to another, maybe maybe not, but sometimes those sayings have a habit of being truer than what you think and maybe the reason why when we think about things we have certain time frames, certain moments between when the words come out, etc., maybe those times are literally associated with the amount of time the things take. signals in propagating through a certain number of layers in our in our uh in in our brain and I think in that sense, if that's the case, there will be features of language that yes, we have this brain architecture, we're going to have these features of language and to the extent that As language evolves to the extent that it is worth adapting, having a different form of language that is optimized by having some different form of brain structure that will be driven by natural selection, etc., I mean, I think you know. it's aspects of language like we know we tend to remember five chunks, you know, chunks of five, so to speak, things at once and we know that if we try to give a sentence that it gets deeper, deeper, deeper. subclauses we lose it after some point and that's presumably a hardware limitation of our brains, okay Dave is asking, this is a good last question: how hard will it be for people to train something like a personal chat GPT that learn to behave more and more? as a clone of the user I think I don't know, I'll try.

I have a lot of training data, as I mentioned, you know, 50 million words written per type, yeah, words written, for example, from me, um and uh. um, I guess I mean, I know someone tried to train an um, an older gpt3 with stuff of mine, right? I didn't think it was terribly good when I read it, once trained for other people I thought they were pretty decent when I read it. I looked at a train for myself because I know myself better than anyone else. I think, um, uh, you know, it didn't ring true, so to speak, um and uh, but I think that's going to be a uh, you know.

Being able to write emails like I write emails will do a decent job. I suspect that, you know, I would like to believe that, you know, one still, as a human being, still has an advantage because, in a sense, one knows. What are the objectives? You know this system, your goal is to complete the English text, and you know that the big picture of what is happening will not be part of what you have, except to the extent that you learn the aggregate big picture just by reading. a lot of text, so, you know, but I think it'll be interesting.

I hope you know me as a person who receives a lot of emails, some of which are pretty easy to respond to, in principle, you know, maybe my mistake. being able to answer the easiest things for me, well that's probably a good place to wrap this up, thanks for joining me and I'd like to say that for those interested in more technical details, some of the people on our machine. The learning group is going to do some more detailed technical webinars on this material and really dig into how you would build these things from scratch etc. and what more details on what's actually going on, but it should end here for now and thanks for joining me and goodbye for now.

Watch Video & Subscribe

If you have any copyright issue, please Contact