This is why Deep Learning is really weird.

Apr 25, 2024

Now

this

book is called Understanding Deep Learning to distinguish it from other more practical books that focus on things like coding. This book is about the ideas behind

deep

learning

. After reading

this

book, you will be able to apply

deep

learning

to novel situations where there is no recipe for success. The book starts right away by describing deep neural networks and takes you through the process of training tests. How do we improve its performance? Then he starts talking about different architectures. Convolutional networks. Residual networks. Graph neural networks. Transformers. a long section on generative models that normalize flows Vees Gans diffusion models uh a short section on reinforcement learning at the end there are two chapters that I think are

really

interesting uh there is a chapter called broader working deep neural networks where I try to interrogate a bit um why we need this particular type of architecture why it's easy to train why it's generalized we don't

really

have answers to those things, but I present some of the evidence that exists and the final chapter is a chapter on ethics um uh I think The book will be useful if you don't know anything about deep learning.

It will take you from scratch to a location close to The Cutting Edge. If you are teaching deep learning. It will be an incredibly useful resource. It has 275 figures, most of which are new and represent things in different ways. He also has a bunch of Python notebooks. If you are part of the Ranken archive of machine learning practitioners or researchers, it will fill in the gaps in your knowledge. And maybe. make you think about things in a different way. I think even my initial description of deep neural networks is a little different from how they are normally described and I think you will learn a lot from it because I am a member of the ranks of deep learning practitioners and I learned a lot from writing it, so I hope you learn something too , if your name is Jeff Hinton or Jurgen Schmid Huber, I can see, it may not be that useful to you, well, you never never know, so the title is a bit ironic because at the time of writing this article no one understands how do deep learning models work, literally no one now deep learning models learn peaceful linear functions and, as you will know in our episode on spline theory.

More Interesting Facts About,

this is why deep learning is really weird...

Deep learning divides the input space into many, many small regions, in fact most models have more regions than there are atoms in the universe and, frankly, it's a mystery, it's a damn mystery, how these models generalize and how they learn these functions, no one. You know, so why does deep learning work? It is notable that the fitting algorithm does not get trapped in local minima or near saddle points and can efficiently recruit additional model capacity to fit unexplained training data wherever it exists, perhaps this success is less surprising. when there are many more parameters than training data, however, it is debatable whether this is generally the case.

Alexnet had 60 million parameters and was trained with 1 million data points; However, to complicate matters, each training example was augmented with 248 transformations. gpt3 had 175 billion parameters and was trained with 300 billion tokens there is no clear case that any of the models were overparameterized and yet they were trained successfully in summary it is amazing that we can adapt deep networks in such a way reliable and efficient, whether it is the data, the models, the training algorithms or some combination of the three. must have special properties that make this possible the efficient tuning of deep learning models is surprising and its generalization is astonishing first it is not obvious to Priory that typical data sets are sufficient to characterize input-output mapping second deep networks describe very complicated functions and third generalization improves with more parameters, this parameter surface gives the model freedom to do almost anything among the training data and yet behaves sensibly, it is not obvious that we should be able to adapt networks deep learning or that they should generalize deep learning Priory should not work. and yet the success of deep learning is surprising.

In his book, Professor Prince analyzed the challenges of optimizing high-dimensional loss functions and argued that overparameterization and the choice of activation function are the two most important factors that make this tractable in deep networks. showed that during training parameters move through a low-dimensional subspace to a family of connected global minima and that local minima are not apparent, so as we over-parameterize these models, generalization increases, but also is related to other things like flatness of the minimum and inductive priors it seems that both a large number of parameters and multiple network layers are required for good generalization, although we still don't know why many questions remain unanswered;

We currently do not have any prescriptive theory that allows us to predict the circumstances in which training will take place. and the generalization will succeed or fail. We do not know the limits of learning in deep networks or if much more efficient models are possible. We do not know if there are parameters that would generalize better within the same model. The study of deep learning is often driven by empirical demonstrations, and Simon admits that they are undeniably impressive, but they still do not correspond to our understanding of deep learning mechanisms about ethics. Simon said it would be irresponsible to write this book without discussing the ethical implications of artificial intelligence.

This powerful technology will change the world in ways possibly not unlike electricity or the internal combustion engine, the transistor or the Internet, the potential benefits in the design of healthcare, entertainment, transportation, education and almost all Areas of commerce are enormous, yet scientists and engineers are often unrealistically optimistic about the results of their work. and the potential for harm is just as great. Simon argues that everyone who studies, researches, or even writes books about artificial intelligence should consider the extent to which scientists are responsible for the uses of their technology. He said that we should consider that capitalism primarily drives the development of AI and that legal advances and deployment for social good are likely to lag far behind, we should reflect on whether it is possible, as scientists and engineers, to monitor progress in this field and reduce the potential for harm, we should consider what type of organizations we are prepared for.

How serious are they in their commitment to reducing the potential harms of AI? Are they simply ethics washing to reduce reputational risk or do they actually implement mechanisms to stop ethically suspect projects? Simon invites readers of his book to investigate these issues further. It is undeniable that artificial intelligence will radically change society, for better or worse. However, optimistic visions of a future utopian society powered by AI must be approached with caution and a healthy dose of critical reflection. Many of the benefits of AI for Ted are beneficial only in certain contexts and only to a subset of society now the book cites Green 2009, who highlighted that a project developed using AI to improve police accountability and alternatives to incarceration and other developed to increase security through predictive policing are advertised as AI for social good in Big Air quotes assigning this.

The label is a value judgment that lacks founded principles. The good of one community is the harm of another. Ethical AI is a collective action problem, and the chapter concludes with a call for scientists to consider the moral and ethical implications of their work. Every ethical issue is out of control. of each individual computer scientist, however, this does not imply that researchers have no responsibility to consider and mitigate, where they can, the possible misuse of the systems that create, we dive, you know, actually do things and, again, maybe I'm just being a human chauvinist, but I'm all for it being integrated into our cognitive ecosystem, but hey, aluto GPT doesn't do things very well, but it does it well, although I don't know, it does, I mean, yeah, like in the way a cat flap does things, you know you can make it do things, it can execute code, etc., but I wouldn't say it has agency.

Hi everyone, I'm Tim from your go-to channel and podcast for everyone. things about machine learning, AI and philosophy, today I am reaching out with a special request, as you know, creating content for L requires a considerable amount of time and resources, it is a labor of love that I do solely for the fun and passion of the topic, um. But to continue providing you with the high-quality content you expect, I need your support. Consider supporting us on Patreon. Every little bit helps, even if you only donate a negligible amount of money depending on your situation if you can't afford it. just let me know and i'll give you free access to patreon benefits, no questions asked anyway, thank you so much for your time and i can't wait to welcome you to our patreon family, signing out for now, regards simon, es absolute.

An honor to meet you Welcome to machine learning Street talk I'm very happy to be here, so tell us about yourself. Well, I actually started my career in Psychology, my PhD is in Psychology and then I've been wandering around various parts of science. I worked in neuroscience for a while. I did some initial work in augmented reality. I dabbled in medical imaging over the years. I was a professor at UCL and worked in computer vision and am probably best known for a book I wrote about weather. and for the last decade I've been working primarily in the industry in finance and computer graphics and I'm currently a professor at the University of B, where I've been working on a new book, Understanding Deep Learning, which will be published by uh, press of the MIT, sorry, I should say which one is published, which is fine, um Simon, we were joking before that um, in your last book, you were over the top and you were writing a book on computer vision and probabilistic graphical models, etc., and then that was it.

SG guy, we'll get back to him later um well it wasn't him, it was actually Crevi, but they were all basically guys from Hinton, they launched alexnet right? And then that was computer vision completely figured out, so my last book was a really ambitious attempt to basically reshape all of computer vision as I saw it, formulating what was a fairly ad hoc selection of methods in terms of models. probabilistic graphs which wasn't necessarily at the time how everyone thought about it and in 2010 I went on a sabbatical at the University of Toronto and worked on this book probably with Alex Kvki in the next room and sharing an office with the postdoc from Jeff Hinton.

I worked tenaciously to get this book published in 2012, a few months before Alex. The neck came out and the whole field took a right angle turn, leaving my book in the dust, although I still think it's pretty useful, geometry and stuff definitely still holds, and it has a lot of stuff on basic probability, but this book . it's less ambitious, it's a simpler description of where we are, with deep learning, it's supposed to be the spiritual successor to good friends Benjo and Corville, oh yeah, which was published in 2016, so obviously a lot has happened since so. you need a kind of pragmatic middle ground between very theoretical things with a lot of tests and very practical things with a lot of code no tests no code uh it's about the ideas that drive deep learning, okay?

So, Simon, when? I started writing this book, what was the main idea I had in mind? Well, I think the story of deep learning is that experimentalists have gotten way ahead of the theory and now we have this explosion of papers where there's literally an exponential. increase in the number of articles that are published and when I say literally I mean literally, there is a plot in an article that came out last year where, in semi logia, it is a straight line with 4,000 articles that are submitted to AR each month, uh and presumably now there's more than that, obviously, it can't increase exponentially forever, there's a finite number of humans on the planet, not all of them can do machine learning research, so there's a staggering amount of information available if you're a new person coming to machine learning. or want to learn something new, it's almost impossible to find good resources, people are learning things from hastily written blogs by people who don't always know what they're talking about, so it seemed like something reallyuseful. could do for the community where I could write down all the most important things that have happened in the last 10 years connected to deep learning with the same notation.

Illustrated in a modern way without regard to history. I don't start at the perceptron, I jump to the right. in deep neural networks and you know, 20 pages of the book, not 160 pages, um just to collectively save the community a lot of time, yeah, and do you think deep learning is alchemy? No, it's not alchemy at all, I think in the future. What we will consider is the science of modeling probability functions and distributions in very high dimensions. I mean, I think it'll be reframed in terms of science, right now we're more concerned with how we organize our whole community is about results, so we don't really talk about it that way, but I think within 40 years will look back and say, well, in the 2010s they studied how to construct functions and how to model probability distributions in dimensions. which are higher than say 50, yeah, so I was ironic on the Alchemy point, but I guess number one people have made some

weird

analogies with neuroscience and biology, even even the word neuron is pretty interesting, so It only appears four times in his entire book and on two of those occasions he advises us not to use the word neuron, so let's begin.

With that, yes, and I have to say that I will probably accidentally use the neuron during this conversation because it is so ingrained in our community, but I think it is a terrible analogy, there is no evidence that the brain works in any way, that deep neural networks , if you look, in the kind of brain epip phenomena that you know, uh human computing, it seems like we have things like short-term memory, it seems like we need to dream to establish memories. We have a modular brain with special parts for recognizing faces and navigating. in the world and so on, there is no evidence that deep learning has any of that and, likewise, there is no evidence that the brain has any of the epip phenomena of deep learning, so you know there is no evidence of double descent or contradictory examples or lottery tickets in the world. brain, as far as I know, and this is fine for our community because we know what we're talking about, but now deep learning is becoming something really important in the real world and that's why we're trying to communicate this. for the general public and we're talking about neurons and neural networks and that carries a lot of burden, you know, if you do an interesting experiment that everyone watching this should do is go and talk to someone at a dinner party who doesn't know anything very a lot about the AI that works, you meet a lawyer, someone smart who works in a completely different field and ask them what their understanding of current AI is and the answer you will almost certainly get is that they have no idea what they could give . a couple of buzzwords, but how does it work?

You have no idea and I really disapprove of the neural metaphor just because it comes with a lot of baggage that implies maybe that the Network is having thoughts or that it's something like us and that's uh. deeply misleading to people outside of our community and of course everything we're doing increasingly affects those people outside of our community and we want to give them sensible information about what we're working on, yeah, so I mean again a The bottom line is that we are now dealing with multiple levels of emergence and what I mean by this is that people understand gradient optimization, they understand parameterized models, they don't understand emergent phenomena and they are looking for analogies, let's say psychological analogies, already You know, they talk about things. like theory of mind, I'm not entirely convinced that there are emergent phenomena, depending on exactly how emergent pH phenomena are defined.

For me, an emergent phenomenon would be something where you gradually increase the scale and suddenly there is a phase change where you can see new phenomena, I guess, if I want a better word, suddenly it appears with scale and I'm not I'm sure we've done those experiments very thoroughly, what you're really telling me is that the STA statistics of data on the Internet are surprisingly rich that we can know how to complete sentences or translate from one language to another. All of that simply reflects the statistics that are on the Internet and it is surprising that when you gather so much information and put it together in a network and reproduce those statistics uh, we see these phenomena, but I don't see it as a professional property of the network.

I see that property as a property of the data we are putting on the web, whether it comes from the wealth of statistics or not, I think we are looking for mental frames of reference to understand such phenomena, it's just that we need a way to understand this correctly and psychology has a theory of literature about the mind that we can use and I'm almost certainly overloading it and bastardizing it, but you say we shouldn't do that. I have the most reductionist possible view of this. I think about when I see a big language model, I see a huge equation with billions of terms in it and we have set the parameters of those terms so that it performs some kind of behavior and I don't think there's any big meaning in that behavior, um, it's an equation, there are inputs that you calculate, you add things, you multiply things occasionally, if it's a transformer, you take an exponential.

Yeah, um, and some numbers come out on the other end that you then translate into words and I don't think it can be understood on any deeper level than that, but couldn't you then argue that there's nothing special about it? our mental states or our behavior because we are also at some level just performing simple calculations. I think you could argue that way, but there's more at play because we have a larger variety of brain systems that interact with each other, some of which have been built on top of each other during evolution, so there's more structure in the system. natural model, you know, it's not at least the human brain is not an equation, it's a bunch of equations that interact with each other in a complicated way, so, if that maps. to the kind of constructs you're talking about, I don't know, but I see the human brain as something quite different, it's mainly a matter of complexity, so when you have this rich kind of functional dynamics of things interacting in the world Physically you have the emergence of agency and all kinds of moral status, etc., and you're basically making the claim that in neural networks we're nowhere near that kind of emergence.

I mean, you're asking me questions that no one knows the answer to because the only existing model we have that works is the human mind and that seems to work on quite different principles than just scaling. Yes, I think there is something very interesting about Transformers, although I was arguing before that I don't like this neural analogy, but large language models like chat gbt are the closest thing we have to something like the human brain because they don't, As I put it, they map an input to an output, but they also have this kind of short-term memory, which is the context window, in a sense, and it's ironic that we don't refer to that in terms of a neural analogy because , to me, is the first thing we've built that looks something like this. like the human mind, you have this continuous text and it makes the next token prediction based on that context and in principle then you could operate on the previous context and you know the system itself could operate on the previous context. sum it up, you could file things away, you could ask yourself to generate other hypotheses, you know, different hypotheses to explain something and compare them and decide on something and use it as a kind of scratch memory in the same way that we have a working memory. but strangely we don't mean that in terms of the neural analogy, which I find quite ironic.

I don't know if people are working on that kind of stuff. I assume people are working on it all, but I haven't read it personally. any work where the Transformer system goes back and edits things in its past context, yeah, um, but I guess that would be a direction you could take to try to get this system to do something that's more like thinking, I mean, in the end , a purely forward system can't really do anything sophisticated, it probably needs to do some kind of manipulation if only to generate internal consistency, so there is no way a large language model can have internal consistency.

You learn everything on the Internet. Think that, uh, the Earth. it's both flat and round with different statistical proportions, you know, hopefully most people on the internet think it's round, that's the conclusion we've come to, but in the weight somewhere there's flatness of the Earth, so to get to another level of cognition you're going to need something that builds an internally consistent model of what's out there, whether it requires, as one could argue, interaction with the real world or whether that is can do purely in the domain of language remains to be seen, but I think it could be One Direction. that people would take things, you know, the body of knowledge of humans is a kind of virtual phenomenon that happens to all of us physical earthlings, so you know, this infosphere that we have created is like a symbiotic organism and it has consistent artifacts of knowledge like you said, but many humans hold the view that the Earth is flat, so it's just another example of this interesting kind of emergence levels, but they hold an internally consistent view that the world is flat.

I mean, as far as they're concerned, it's internally consistent, obviously, there are inconsistencies that are proven pretty easily, but in their mind they explain that they have a model of the world by which they explain everything well if the moon is flat, obviously the sun. it must be flat too and it must also be flat and you know, when you look at the horizon of the sea it looks flat and, consequently, the Earth cannot be round. You know, they explain other phenomena and build a model to support it. the hypothesis and it doesn't make sense that the Transformer system is doing anything like that, it just starts from the beginning and predicts the next word and has the statistics um uh uh that are consistent with what is previously in its context window.

Yeah, you could argue that, um, humans. Our brains are also very chaotic, but we have this post hoc confabulation and rationalization in the same way, so we know that unconsciously we have contradictory views, but when we try to explain our views and avoid cognitive dissonance, we try and reduce what we think. We think of something simple, true, but we have a finite number of points of view that are sort of partially rounded theories of the world. The big language model has everything Hum ever created with no preference for one thing or another other than its statistical probability.

Yeah, that's right, even if you have multiple conflicting views of the world, you know there's that famous poem by Wal Whitman where he says, "I contradict myself very well, I contradict myself, uh, I am multitudes, I am great, you know that's human ". captured beings, um, but we don't have all views of everything simultaneously, we are trying on some level to create consistent models of the world and we need to do that because we need to take actions in the world and it's impossible. do that if you have 50,000 conflicting views on how things work and this is really interesting because Hinton says that one of the reasons GPT is a kind of superintelligence is because it knows all the things, but I would say that you just got it. do.

I think we are a little limited as observers. There is some sort of computational restriction on how many things a single observer can understand at once and we'll talk more about this later, but I think with cognition it's not just about knowing. It's also thinking, so knowing everything isn't actually the whole piece, yeah, can you deduce new facts? I think in one of your other podcasts that you talked about if you trained, chat GPT with data only until the beginning of the 20th century would be able to reproduce Einstein's theory of relativity. I think we all know the answer to that, I definitely wouldn't do it and one of the missing pieces, but what do you mean by what I was saying before?

It is true that to build that theory you need to have a model of the world and you must realize that that model of the world is incorrect, that is certain facts. Personally I don't think you need to observe those facts yourself, but certain facts are inconsistent. with that theory and then you need to somehow come up with a new model that itself makes new predictions about the world that people can go and test in the case of physics, but I think that happens on a smaller scale. Learn with your theory how companies work or how your friends' personalities work and what is the best way to interact with them. you know you have theories about everything, uh, that occasionally break down and you have to radically rethink them from a computational point of view.

Arefinite state autometers. I mean, you're saying that any finite entity can only calculate certain things by the nature of the fact that it is finite, and to put it eloquently, of course, that's true, but I'm not sure that's a radical idea. I mean, what would be interesting is if we had some way to characterize what kind of things could be computed with a certain number of parameters of uh degree or whatever, but as far as I know, we don't really know the function space between the input and output that we can completely describe given a fixed set of parameters and a certain neural network architecture, um or what's outside of that, because you're building this very complicated surface in a multidimensional space and everything depends on everything else , so it's not really obvious, but it looks like it's very rich. in the sense that we give it almost any data set we want and with enough capacity it can fit well.

I think we're about to put the pin in the middle of the bullseye here, which is that, and we'll talk about Universal. function approximation later, which is that given an infinite number of neurons, you can approximate a function to an arbitrary precision, that's a bit like saying that if I have a hard drive of infinite size, I can store any file I want, like this that the big argument between connectionists and symbolic people is that the symbolic people argue that we need an infinite amount of computation for many things, neural networks have told us that in many cases no, we don't need it and then I guess it's just Well, what?

Are there edge cases where we need an infinite amount of computation? I mean, you know, interesting question that you're asking in different ways, other ideas that the human mind can never understand or, in fact, that we can understand, than a neural network. Can't I, um, Simon, now we've teleported to our study? uh, we were outside freezing to death, it's a lot warmer and less muddy in your studio, actually, that's right, um, so Simon, um, we'll go straight to the Chaser. I've been reading your amazing new book on deep learning and I think we should start by talking about the first few chapters, particularly chapters three and four, where you talk about not only neural networks, but you also know deep neural networks and just um, single layer neural networks, but there is this kind of elephant in the room in general.

The way I feel about neural networks is why they worked so well. You already know the unreasonable effectiveness of neural networks. Why do they work so well? Well, the good news is. They work really well if you've been asleep for the last 11 years or so, um, they're incredibly effective, but it's a bit of a mystery and I think it's particularly interesting to look at it from the perspective of just before Alex net came along, so image net on At that time it would have been considered a real stretch goal, Richard Cisy wrote in the computer vision book he published at the time that he expected it would be years and years before computers could see as well as a two or three year old image, um and so the image network is a challenging task, the input dimensions image 224x 224 which gives you about 150,000 uh uh dimensional space, uh, so, roughly, what you know works like about 10 of the 150,000 possible images. , you could say it's okay, good. most of those images are noise, but presumably the actual variety of images is still extremely large and you have to build a model that maps this to one of a thousand classes and you only have a million examples in this 10 to 150,000 dimensional spaces and for each one of those classes you only have a thousand possible examples, so you can't even construct a Gaussian for each Cass in that case, so if you didn't know that humans do this task, you might even give up. but because we knew that humans could do this task, people persevered and Alex Nnet set out to do this extremely ambitious program that was completely different from what most people were doing at the time.

I know there are possibly some PR predecessors I don't want. Get into that, I'll talk about alexnet and you can judge for yourself if that's the right thing to talk about um um, so Alex net sets out to build a neural network with 60 million parameters, several orders of magnitude larger than most computer vision models that would have been built at that time and the way I think about neural networks is that they divide the input space into convex polytopes. Each of which has a linear function or more properly apina inside it, so if the input dimension space is two dimensional, then divide the input space into convex polygons and each polygon is some kind of plane and those planes are organized so that they form a continuous surface, but now we have um 150,000 input dimensions, so you're in very high dimensions and We have this incredibly, our model creates this incredibly complicated surface and it's difficult to count the number of these polytopes that it creates , but only the F fully connected layers at the end would be on the order of, you know, 10 to the power of 4000, so you have this huge space, you have a model with 60 million parameters that creates a number of regions much larger than the number of atoms in the universe.

None of you know that almost none of those regions will ever see a training data point or a test data point at any point during training and there's a super complicated relationship where you modify one of your 60 million parameters and this much larger number of polytopes changes and manipulates in some very complicated indirect way that is difficult to characterize, so now you come and say, okay, I'm just going to do a 60 million dimensional optimization problem at that point in vision by computer, you know, people were considered ambitious if they were doing thousand-dimensional optimization problems, you know, they're usually the biggest ones.

These types of problems would have been a structure from motion problem, so you would probably try to find the solution close, an approximate solution, and then just use nonlinear optimization. It's already somewhere near the local global minimum and it's just going to find the use optimization to get to the final place, but they'll start by randomly initializing this network and then use the dumbest algorithm, basically noisy gradient descent to get to the bottom, because you can't use anything else because everything else. uses second order information and now the number of parameters is too huge sounds like a disaster, sounds like a disaster.

I don't think many people would have predicted that you could learn the model, but one thing they have going for them and we can Coming back to this in a moment, you could say that there are many more parameters in your model than there are data points that are overparameterized and maybe that make the fit easier and there's a subtlety there that we can go back to, yeah, okay, they fit. the model, but like I said, there are 60 times more parameters than there are data points, so the surface they fit to basically goes through every data point exactly, but for every data point it goes through, it can do 60 other things , has 60 more degrees of freedom.

So between those data points, which, as we've discussed, is almost the entirety of the space, you're free to do whatever you want, and they actually applied some regularization and some dropout, but we subsequently found out that those aren't really critical. for this. uh, at the time you may have explained it with that, but since then we know that we can learn models without dropout and without regularization and they still generalize reasonably well, yes, and somehow this model performs 15% better than the next one best model or whatever. It was pretty amazing, I mean, really amazing, um, yeah, well, so let's make a point of order on that, so, first of all, I'm a big fan of this view of neural networks that you just talked about, which is essentially short and divides environmental space into these locally affine polyhedra and Randall Belri um came up with this spline theory of neural networks and people should watch Laon's first show where we talked about that in detail and what he basically said was that for a time Given the input example, the neural network can be represented with a single apine transformation, which surprises many people for two reasons: first, people think that neural networks are nonlinear and it also gives them this beautiful intuition Networks are a collection of storage buckets not unlike a locality-sensitive hash table and I think that's very instructive, so I wrote this book blissfully unaware of that theory, I should say, but that's exactly how I describe it. in this type.

In a simple constructive way, how do we combine these networks? We must say that this is only true for parametric law d and Rel. If you have smoother functions then you can't characterize it this way anymore, but I actually like to discuss it in terms of confidences because then the number of output regions gives you an idea of the complexity of the output surface, whereas once When you start talking about soft functions it is much more difficult to characterize them, but it is something locally soft because I suppose for people. at home, each one of these neurons, we'll know, maybe we'll come back to that, it's like a kind of hyperplane, so you train the network and you move all these hyperplanes in ambient space and essentially when you put an input it activates some of those hyperplanes and creates this type of convex region.

Depending on which hidden units are active or not, yes, and although these are piecewise linear functions, there are many of them like You know, densely overlapping each other in space, they appear locally smooth. Presumably we can plot one-dimensional graphs through the function and see what it looks like, and of course it looks smooth due to the large number. All of them are correct, but then we get to the next interesting part, which is that you know, you talked about overparameterization and you know, like a finger in the air, let's say there's an exponential number of these convex polyhedra, you know something like I think You said two raised to the power of the number of dimensions and um well, generally it would be more than that.

I was trying to come up with an absolutely lower limit that no one could disagree with, but but this gets fascinating, um traditionally before neural. On the internet, the world was different and we used to talk about the curse of dimensionality and that's basically this statement that there is an exponential relationship between the volume of space and the number of Dimensions, so yeah, I would say that's the trend. of the volume of space to completely eclipse the amount of data you have as the number of Dimensions increases exactly, then, when, when, when the volume of space increases exponentially, the statistical significance of your training data tends towards zero, for which since there is no statistical information which begs the question if there is no statistical information, how do networks work and now we are getting to the difficult part, why is it generalized in any way?

Yes, it is clear that you can pass through each data point, but what do you do between them. the data points is, I mean, in some ways, it almost happens, it's a byproduct of our algorithms, uh, it does a smooth interpolation, why it's a good smooth interpolation, we don't know why it even becomes smooth, we don't know, It seems like it's some complicated byproduct of the way we initialize neural networks, the noisy algorithms we use to train neural networks, the overparameterization itself, maybe you know, what kind of thing could be happening is that we end up with smooth networks that they interpolate just because overp parameterizes so much that um uh when we set up our networks we initialize them and the wider the smaller numbers we enter and the numbers start small and they stay small because it never has any push to make them larger and because the numbers in the network have small magnitudes that are basically assigned to small slopes.

I want to go back to the overparameterized a little bit too because it's not totally obvious to me that everything is clearly overparameterized, so while Alex net had UH 60 million parameters and 1 million. training points, they also increased their data by a factor of 2048 and if you look at all the papers more or less on this image network classification problem, they are all overparameterized by the best results of the last 10 years by classification. 10 to 100 times, but they also embellish the data that muddies the water, so one way to think about it is that these are more data points and we are not over-parameterized, but of course they are no longer independent data points. translations rotations uh color transformations of the input image, so I don't really know where we are with the over-parameterization, yeah, well, I want to get to the um to the induced prior in a minute because I think these local apine polyhedra are cubes and then we get Well, if you make any transformationsemantically equivalent as a translation or a rotation as far as MLP is concerned, it's a different thing, so these inductive researchers basically photocopy the information so you put it the same way in different buckets so it's like cheating, maybe Let's get to that in a second, but just to finish the cycle of this generalization thing, what we are doing is making the neural network more complex and this is really It's strange because we were taught in school that Oims Razor is that simple things They become widespread and now we are making networks exponentially more complex, why are they still widespread?

Well, it's interesting how the theory can't really cooperate with those previous ideas of A theory like Ramaka complexity would have predicted the opposite, it would have predicted that as we added more parameters, the generalization would have generally gotten worse. I'm talking about the kind of vague idea people have when they talk about twice as many sentences. Are there more regions? It just means that you can model a smoother function, it has to go through the data points and it's smooth elsewhere and we don't have a much better conception than that, you know, for things like images we can incorporate some prior knowledge about the images that the statistics are the same everywhere, we use convolutional networks so we are searching through a more sensible subset of models.

You know, the convolutional network is kind of a strict subset of a fully connected network, um, so we're looking through a more sensible set of models that have some PRI knowledge that we added, but that still doesn't explain to me. satisfying way why networks generalize so well. Can we incorporate this notion of the multiple hypothesis? As it does? Because even before you We were saying something really interesting, which is that everything is an inductive prior, you know, even the world we live in is an inductive prior, but remember to keep philosophy out of the discussion, data is definitely an induced prior, so you can boost the data. and you can effectively upsample what you know in the vicinity or you can tell the neural network that it doesn't.

If you know certain types of transformation, you can effectively tell it that the transformation exists by augmenting the data. Yes, well, yes, there are two ways to incorporate that into the network to make it equivariant or invariant depending on your needs, or you can just screw it up. with data yes, TR, you transform the output the same way you transform the input if you want it to be equivariant or ignore Transformations if you want it to be invariant that type of transformation, yes, and there is an interesting relationship between us. We have this bias difference compensation and you know we're doing statistics here, so in an ideal world you would only need to give it a labeled example of what you want to learn because the network would understand all the transformations that you know. that could happen and it doesn't work, so there's kind of a middle ground between, on the one hand, we have this dumb MLP that doesn't know anything and we're introducing all this, all this, what would you call it mastery of physics? knowledge of how things can be transformed domain knowledge seems like a good term yeah yeah yeah okay interesting so we were talking about this multiple hypothesis so that's the idea that there's some kind of subspace in the data and because theoretically, as you were saying before, according to italic dimensionality, it should be physically impossible because you would need more data points than atoms in the universe to train these models, which is why people cite the multiple hypothesis, that is, there is some kind of subspace in the data where um, you know, the string can focus its attention on what you think about it um, I mean, in some cases, there's a very simple case where you can almost describe that Subspace, which is if you take someone's face and you film it in fixed lighting from a fixed point of view and they don't move much, then they have 42 muscles in their face and only 42 things can happen and that's kind of a physical description of course, the world Reality is more complicated than that, but there are still regularities.

There are only certain types of materials, there are only certain types of lighting, real images are not all images, you know, pick random pixels and see how many times you have to do it before you get something that doesn't look like noise. you'll give up very quickly, although we don't really know the size of that range, although you can estimate it because it's related to the degree to which you can compress images, so one thing that's really interesting to me about these new diffusion models is If You know they're not that big, you can put them on your hard drive, but they still seem to be able to generate this incredible variety of images, so we don't know what images they can't generate, of course, but a good one.

Apparently, a part of the variety of images can be placed on a hard drive, suggesting that this is not actually the case. You know that the space of natural images is actually not that big. Yes, it's interesting. So you're saying that the apparent creative and generative capacity of these models suggests that there is a variety of low-dimensional images um yeah, well there must be because we don't know, you know the generative model doesn't have 10 to 150,000 parameters, yeah, uh for the very good reason that you know we can't store even a small fraction of that, very interesting, okay, so we should talk quickly about this, you know, the universal function approximation, so if we start first than nothing with a shallow network, this universal function approximation theory basically says you know if you get close.

With arbitrary precision, you can represent any function and it is just a collection of basic functions? You know, we're just optimizing the placement of these basic functions. I mean, you can still think about it by simply dividing the input into convex polytopes. Also that's fine or you can think of it in terms of basic functions, so I've always thought that this universal function approximation theorem is not particularly useful. I mean, it's important for us to know that you can represent it with some uh um uh. uh, warning, you can represent any function you want. What's really interesting is that to build anything that works really well, it seems like we need 10 to 12 layers and it's hard to know, and there are quite a few different theories as to why we would need a deep network, given the universal approximation theorem, says a shallow network will do, it can model anything well.

I mean, just try it, so the theorem talks about a very broad single-layer neural network and it seems to me to just lay the foundation. works on a single layer of a neural network, it's almost the antithesis of generalization, it's memorization by definition, so you make this really interesting argument in your book that when you introduce depth into a neural network, what you're doing is " you". We're progressively folding back those apine regions, right, yeah, that's a partial way of thinking about it, so a partial way of thinking about it is that if you have a two-year-old network, you can think of the first network as folding and replicating the second network. . in a complicated way and yes, go look at a picture in my textbook that is difficult to describe verbally, but that's only a partial way of looking at it, there are other ways of looking at it, as well as a different way of looking at it. is that you're creating, I think, what you would call increasingly complicated basic functions and then you're cutting them out and recombining them to create even more complicated functions, so that in one way it emphasizes this symmetry and this folding that's going on, and another This way emphasizes this type of clipping and creates more joins in one dimension, they would just be joins in the function in two dimensions, they would be kind of one-dimensional folds in the function, um, that's a different partial way of looking at it, but I don't think It's not really difficult for anyone, even for a tway network, to have a full idea of exactly what they're doing, but what's definitely true is that there's a really complicated relationship now between manipulating any of those parameters and everything. the surface that appears. on the other end, yeah, and um, when we made our first program on spline theory, we were showing this neural network visualizer and you can add layers and you can change the activation functions and it shows you the activation space and very, very quickly These very complex behaviors arise in which you know that the hyperplanes cancel each other and what you were saying there what I want to get at is this topological interpretation, that is, as you become more complex as you go through the neural network in successive layers, you can recompose all the PRI neurons in previous layers that are topologically addressable and this is a dag that we are working on so obviously it is a subset of those and there is something really interesting that the neural network has a structure that is defined by the first layers, so as you go through the neural network, it is somewhat limited by what happened before, I mean, it is limited in terms of information if you lose information at the beginning.

It can't be recovered, uh, but I think you're trying to get at something else, some of that has to do with how Flex, you know there's a continuous learning dimension to this, meaning from the beginning of the neural network. those initial um basic functions are making cuts in space and then the way I think about it is that all the kind of complexity that happens later works within those first regions, so it's almost like it's becoming more and more more ingrained in what it does as you go through the neural network, that is, it becomes increasingly difficult for it to learn completely different things.

I think I'm trying to interpret the way you phrase this, but I mean maybe what you're saying. What we're trying to say is that it's like collapsing parts of space on top of themselves and then treating them similarly and then you can't unpack them and that might be how regularities in the data could be exploited, that's one interpretation. with reflection, I think it actually happens the other way around: at first, maybe I didn't quite understand how you said it, but the first few layers are defining how the reflection will happen, how these reflections will happen in the later layers, so it's really in that view, it's the later layers that define the structure in the first layers and then propagate that structure to different ones reflecting it like a crazy house of mirrors in the rest of the space, I mean, I guess like one.

The way I think about continuous learning is to what extent a neural network could be destroyed. Let's say you take GPT and just start tuning it with noise and what would happen: would it very quickly forget everything it learned before or would it just refuse? to give in, that's a complicated question that would depend on the learning rate and yes, I mean, these transformer models are quite interesting because in general, as I understand them and please correct me if you think I'm wrong. I don't think they train them to achieve zero training errors. I think they usually do one pass through their training data and stop, so four, yeah, or three or four, so to some extent they probably remember more of what they saw recently than what they saw a while ago. million tokens makes a billion tokens, whatever it is, yeah, how fast if you added noise would it decrease, um, I'm serious, really yes, so what have you done by adding that noise?

You have changed the entire Lost surface because now the Lost surface only includes the noise or I'm not sure if you are suggesting if the noise is added with the real data, but either way the Lost surface is now different and now you are going downhill from what used to be, I'm going to loosely say global minima and I'll probably say it several times during this, but I mean the kind of level set of good solutions that fit the data well, so now you're going downhill from that point being raised in your new loss service and how quickly it is forgotten just depends on the learning rate and the new shape of the loss function, so I don't know if I could say anything definitive about that.

I guess I was going to get to You know, Chomsky says that we see the world using our native backgrounds and in a sense neural networks are the same as when you look at a vision network, they have these Gabor filters that are built into the first few layers. and to some extent, after a while the neural network only sees the world in terms of its fundamental basic function, so I don't like the idea of considering the representative, you know, you say that these Gabor filters are built into the first layers and I would back off a little bit on that, yeah.

What people are saying when they say that is that this neuron has high activation when you put these kinds of things into it, and to some extent that's trivial; has created a small set of basic functions because this network at this point only sees a 7 by seven patch of the image or something and only because it reacts very well to that thingin particular, you also know that if it's not the first layer, it's in a few layers, it's going to react in a complex way to a lot of different things, so what you're trying to do is try to characterize this very complicated multidimensional polytope by a point that looks like a Gabor filter, but it's actually this incredibly complicated shape in a 150,000 dimensional space and so I don't do it.

I like the idea that you're going to describe, you're going to say: Oh, well, we can characterize that shape by this point. But. I think you are absolutely right, so what we are seeing is an error. interpretability methods that greatly simplify your network and we're making statements based on that, so I think you're absolutely right about that, but there's also this interesting dimension of training versus inference, which is why it's been said that over-parameterization is something which is useful for stochastic gradient descent, but I'd like to hear your views on to what extent we need to have all that representational power for inference.

So do we need to memorize all those things or could we really eliminate them? Yeah, so it definitely makes It's easier Terry Sinowski has this beautiful expression where he says you know it goes from the search, from the search for a needle in a Hast pile to a Hast pile of needles, so you end up with this lost surface where there is a part of very high dimensions. of that loss surface that is the global minimum, by which I mean a set of good-level solutions that perfectly fits your data. Yes, I have a feeling, and I know you're going to reject this, that there is an inherent bias in the model. neural networks, i.e. a lot of these inductive prices and training tricks etc., their ways of making the network focus on the modes in the data, the areas of regularity or where the most variations are found and all low frequency attributes. and to be as clear as possible about this, I mean things that don't happen very often, that tend to be eliminated or ignored or not learned, it's something fundamental that you would think of in the neural network, so I think we do it .

You need to fit the data perfectly or almost perfectly, so for our training data we always get the correct result because ultimately these things are pretty stupid and there's not much they can do other than interpolate smoothly, and I'm using the word smoothly very vaguely. because it's hard to characterize exactly how they interpolate between them um and I'd rather smoothly interpolate between the correct true data points than a neural network that missed them in the first place, you know how it might be better not to memorize the data, so I have a feeling This is a history of the tyranny of metrics, that is, we optimize the precision and the neural network.

I guess I'd learn it well since it's such a long, high-entropy thing. It's probably best that I don't bother learning it. than making mistakes all the time trying to learn it. But I think it does learn, you know, it learns from most data sets, the long tail of the data that you give it because the loss is almost zero, so for the regression model, the loss can literally become 0.00. for classification models, it's a little more complicated because you're trying to push these softmax functions to Infinity so you never get there. training loss, although training loss does, but that doesn't work, so I guess those things are like atoms. in the universe and even though the network knows about them in the training set, they would never have any statistical power, so that wasn't a question, no, it's just I'm just testing the principle, but I guess you know that we and we.

I'll talk about ethics, bias and fairness later, but you know it's really difficult with just one training goal to square the circle to have high accuracy and high fairness, yes, absolutely, I think what you're worried about are the areas of data space that um represents maybe minorities of um individuals and particularly intersections of those minorities uh where you know, maybe you know gender and race or whatever, and then you end up with a very small number of training data points and I think it's a The question is whether you know you're attributing the lack of generalization to the neural network, but that might be unfair, maybe the lack of generalization is because there is literally not enough statistical regularity in the data you've given it for it to possibly generalize. interesting interesting okay well just to close the loop in chapters three and four, so there are some other things that we modified in your networks, like the initialization, things like the learning rate in the activation functions that we use and things like So, I think it's interesting to think about that through the lens of learning or generalization, so we've learned quite a bit about what things affect learning and what things affect generalization, surprisingly the data set. confusingly it doesn't affect the learning as much, so you can randomly alter the labels, yes, and the neural network will still learn the training data fine or you can shuffle the input pixels of the images, the neural network still learns the data from training, okay, so there are some things that would really surprise you.

It may learn, uh, uh, despite the new network will learn, uh, despite those factors. Also other things, uh, in terms of learning, could we pause there because that was really interesting. bit which is that and it will show a picture in your book so there are two curves and one curve is when you're memorizing essentially random information and then the other curve to the left so the kind of the horizontal distance between those curves is the gap of generalization, and so on the other curve is where you have the real data with the real labels and then there is another curve in the middle where you have the real data but with the random labels, so I guess the first intuition is: Isn't it interesting that neural networks don't have to struggle much harder to learn completely random data?

So this is a famous article from Jang 2017 and it shows that you can alter these labels or you can mix them up. around the input data in various ways and it still learns perfectly, but it learns more slowly, but that could simply be because there are regularities in the real data, so you move the surface up and it naturally snaps to various points of data and now it has to contort itself. more to fit completely random data, so I don't think the time it takes to fit it is really significant, but it shows how incredibly flexible the model is - you can throw any data at it and it will fit and interpolate between it's just that the interpolation is meaningless if the labels had no information yes, it's interesting that you say that you can't tell how difficult a problem is because of the convergence time.

Convergence time depends on many different factors, but one of the main ones is where you initialize the parameters, so we initialize the parameters for very practical reasons to certain variations because if you don't, you get explosive gradients, where, for people who don't If you know about this, essentially they are the numbers in your neural network, the activations and outputs start taking on tremendously large values and you can no longer represent them in finite precision floating point numbers or if you don't set them up correctly they end up being too small and a we call that vanishing gradients, but uh it's referred to in terms of gradients, but it actually also happens on a forward path through the network, so we have very practical reasons for initializing our networks, where they are, but actually that also depends, you know, we still have a range of values that can be set and it still works without breaking, uh, uh, uh, uh, uh, uh, well, reaching the limits of computer precision, but actually , the magnitude at which you set the weights dramatically determines training time, so, and also determines generalization. so there's a kind of goldilocks zone that they call neither too big nor too small, just, uh, of parameter magnitudes that, um, cause that seem to give the loss function, um, sensitive properties, uh, curvature positive where you want positive curvature, um, and I think it's pretty easy. understand that if all the parameters are too small, then you can only create something very flat.

I mean, generally speaking, people who really know what they're talking about are going to be offended by this, but generally speaking, if the parameters are very small, it turns out very flat functions and maybe your function is not very flat and if the parameters are really big then you have functions with big jumps, yes, and you don't want that because maybe your data is quite smooth, so it takes longer if you don't initialize the parameters correctly to achieve the presumably reasonably smooth pattern. of your data and of course that also speaks to generalization because you want it to be smoothly interpolated between your data, so you don't want it to vary wildly. param uh parameter values that cause large fluctuations between your data points but still fit the data points, so the time the train takes also depends on the initialization, but if you give it enough time it seems to get there, um uh, in terms of generalization, there's a phenomenon connected to that called grocking gring yeah, I had Neil Nander, right?

Basically, you can find cases where um uh it fits the data perfectly from the beginning, uh, but then it takes years and years and suddenly it generalizes and yeah, this has been interpreted or my best understanding of interpretation is that this This is what happens when you set the parameter magnitude incorrectly. What's happening is that it fits the data correctly, but varies wildly between data points, but there's something wrong with our training methods. presumably some stochastic gradient descent feature or some other expit or implicit regularization you've put in there is causing the network to be the solution, even though it already fits the data.

Cross this surface of loss, it is not at the minimum, it is in a family. from Minima and it's going through that family until it finally gets to something that's smoother and it interpolates well and then generalizes, but exactly why it should move through this surface to something that's smoother is again quite complicated and subtle, yeah, I mean, yeah, people definitely check it out. the Neil Nander episode, but yeah, it's like you say, this phase shift from memorization to sudden generalization. I think Neil said it's a bit of an illusion and people misunderstand how it works and like you said, we're trying to design Lost. the surface to be as smooth as possible and I think your original article was about modulo splitting, which is not very natural because I suppose part of why it is possible for CES the Lost surface to be so smooth and interpretable is also due to these natural data.

So maybe an artificial data set like that wouldn't be as suitable, so you'd have this sudden movement later, yeah, maybe you mean the easiest way to fit it is with some function that, yeah, doesn't have a natural smoothness and consequently you end up fitting it with a very complicated function with a lot of variation between the data points and then it takes a lot of regularization and wandering over this missing surface until we get somewhere sensible, whereas for a more typical data set , naturally To begin with, with a kind of smoother surface, I mean, that makes sense to me, yes, but it's one of the crazy things about deep learning, which is that, and swinging was a surprise because most of the time you can't, it's like it's like cooking, the secret is in the preparation, so in a way you massage the missed serve and then you know you can almost predict, just like an open AI could predict how much training was required to get the gp4 perplexity level, you know it seems to be a regularity in terms of well, it's going to converge and because it's overparameterized and refined in a certain way, it's going to converge in a certain amount of time, it's going to have a certain shape when you train it, yeah , I mean the most. recent article I read about this and you must understand that I am very similar to the GPT chat every time I finish writing the chapter of a book I stop reading because if you don't stop reading you will go crazy, there you are.

I know there are 4,000 articles published every month and you can't possibly keep up to date. The most recent article I read about this was called omnr and it basically says that time is predictable because the magnitude of the pesos is gradually decreasing. at a pretty steady pace and they finally get to this Goldilocks Zone where things generalize well and that's when suddenly the generalization performance improves great, now the book makes a really huge effort. I would say visualize things as you already know. images and low dimensions what was the thinking behind that um well I think people learn in different ways so I try tohaving three ways of understanding everything: there are textual descriptions, there are equations and there are images, and for me the mental process of connecting those things that you have to actively do if you read the book that leads you to a deeper kind of understanding.

Many ideas, including diffusion and Ganss models, can be drawn in one or two dimensions and you can convey these concepts really effectively. - technical way before you get to the equations, which is good to learn, but you also have to be a little careful because multidimensional space doesn't work the way you expect, there are a lot of strange phenomena, so if you take two random points that have a Gaussian distribution, you know that after about 100 dimensions or even less, they are almost certainly orthogonal to each other. One of the most famous ones is that a hyper orange, a multidimensional orange, basically the entire area. it's in the skin and none of it is in the pulp or my favorite is if you take a multidimensional sphere of diameter one, a hypersphere and embed it in a multidimensional cube that you can imagine in two dimensions. you have a circle in a square in three dimensions you have a cube uh you have a sphere in a cube uh how that goes to Infinity it turns out that as the dimensions increase it turns out that the proportion of the space that the sphere occupies becomes zero, it's all in the corners yeah, so, I make these drawings to start the dimensionality again, yeah, I make these drawings in one or 2D, you know, I like to joke, that's because MIT press didn't give me a budget for a five-dimensional book What I asked for. but they were left with two mentions, yes, but you have to be careful, but having said that, I don't think it's necessary to work in the super high dimensional spaces that we work in to see deep learning phenomena and there is a data set really interesting. which I use a lot in the book called mnist 1D, so our simplest data set is mnist.

It has some disadvantages because, for example, we are too good at solving it now to prove much about it, but this makes the data even simpler. set and basically what it consists of is 40 dimensional data, it's just one dimensional data and it starts with a template that looks a little bit like one of the 10 digits, so for example zero has a bulge and eight has two bulges and find a way to do it, it's actually a guy Sam granis is his name and the paper is called deep learning shrinking okay, completely against the prevailing winds of deep learning that had to scale everything massively um and uh uh and then that translates so it's a data type that's friendly to convolutional networks, for example, because there's a dimension that can be equivariant or an aspect of it that can be equivariant or invariant to and it adds various amounts of noise and it is completely procedurally generated so that it can generate any amount of data, uh, with any amount of noise, um, and the interesting thing is that, although this is only 40 dimensions, most of the EP type of phenomena of the deep learning can be seen, so you can show conflicting examples that you can show, um uh uh uh uh, lottery.tickets you can show double drop and you can do all this.

You know it's now at a size where you don't even need the GPU. You can simply run it in your local Python window on the CPU. um uh and I think this. It's a really interesting data set, so huge experiments were built and now we're building things that are on the order of the complexity of a human mind and I come from a background in biological sciences basically and if you want, try to understand that. What you do is you go out and collect a data set as exhaustively as possible in as many dimensions as you can and then you go out and try to come up with a theory that explains that data set.

In this case, when is it trainable? When is it generalizable? and here we have a data set where in theory you could train many different networks by varying all the parameters because it's very fast to train and then you could try to correctly create a theory that says okay when you have this type of data with this amount of noise these statistics these other networks that will work these are networks that we will train these are the networks that we will generalize and we will have proper hard limits on those things um So to me this is a really interesting data set that could be a testbed for understanding better deep learning, but currently not used at all because you can't publish a paper unless you have the latest technology. a huge data set with millions of examples, yeah, because we were going to talk earlier about the alchemy of deep learning, because we don't really have any general theory about deep learning, I mean, there are things like ntk and we talked about spline.

Theory and stuff like that, but there really isn't much, there are bits and bobs, many of them are based on unrealistic assumptions, some often depend on the net being infinitely wide, which actually worries me less than those. because infinitely wide probably means that at some point you're relying on the central limit theorem, which or the law of large numbers, any of which probably converges for a long time, you know it in something like a Gussian or the expected value much before time. you're really infinite, so I'm pretty okay with those depending on um, you know, that the width of the network W is the square of the number of data points or something that that's starting to worry me because it sounds like it's an argument uh based on some kind of combinatorial um uh uh geometric argument and it might actually need that true number to work.

There is a history of experimenters rushing to do things, often introducing components into our networks, which definitely helps empirically explain them in some way. way and then when we go back and look at them, it turns out that they don't do exactly what we said they would do and the theory people are way behind, you know, some interesting parts of the theory, I think they are neural networks. The gaum processes you mentioned are, I think of them as the basic version of that anyway, and so they can predict some things about trainability, but they largely predict things about trainability that we already knew. through experimentation, so, and they have you.

I know that using some of that, they've built networks with thousands of layers, but those networks with thousands of layers don't get state-of-the-art results, so one has to wonder if experimenters already figured this out, but then can't come out with a paper. is so they just throw it in the trash yeah, um, so no, not that job. I think it's really interesting, really important, but just to say that it falls behind and we're in this

weird

situation where a bunch of gadgets that we've installed. making networks work doesn't do exactly what we think it does, so when you first learn about stochastic gradient descent, you're taught that the reason we include it is so you can bounce around local minima, maybe into another valley, but that turns out to be absolute nonsense, I don't think that's what's happening, you can learn pretty large neural networks with huge batch sizes, so there's not much randomization or sometimes even a full batch and still hits the bottom, so why do we still use descending stochastic gradi?

I mean, number one, it's fast, you're only using a subset of data points and if you have millions of data points, that saves you a lot of time, number two, it seems to have some regularization effect and you can actually characterize that and there's a section in the book on implicit regularization so you can show uh for finite it's basically the difference between taking infinitely small steps that you would call gradient flow and finite steps and once you start doing that and it's combined with um the element stochastic, you can show that there is some regularization expression that you can write in closed form and that you are effectively adding to the loss function, so all the learning takes a slightly different path and gets to the bottom of it in a different way. place and I like it, and another example of um experimentalists rushing forward would be Batch Norm, who has a really peculiar story, so Batchnorm is that a good guy, uh no, is it STI STI STI?

I had it in the program, yeah, um Batchnorm was originally introduced. to address something called covariant shift, which I'm not sure I fully understand, but I think basically the idea that you can correct if I'm wrong here is that after you've adjusted the parameters in the later layers, the changes to the parameters in the previous layers They don't make any sense anymore, which from a purely mathematical point of view doesn't make many of you understandable, but I guess it has to do with the fact that they're not actually taking infinitely small changes. We're taking finite changes, but later experiments have shown that you can introduce a covariant change and it doesn't help um and borm was adopted later because then they figured out how to stop um uh uh uh exploding gradients and um vanishing gradients uh they developed residual networks and that comes back to put those problems and the batch norm resets the variance and consequently stops the explosion of the gradients, so it found a use, but then it turns out that there are other ways to solve that problem, you can solve it by simply resetting the gradients. without taking into account the batch statistics, it turns out that it has a regularization effect because it basically adds random noise, the mass is slightly different each time, it also has more complicated effects where it allows a bit of data to be leaked. information to another because one can have a really huge value which makes the lot variance large which then reduces the other variance in the other uh and that's why they don't use it in Transformers with um uh where you're using attention masked because the point of this is that the um words later in the sentence don't have the words don't have access to the data later in the sentence, you're saying all this in the batch, so they use lay Norm instead, Batch Norm is an example of something that was introduced for one reason, adapted for another reason, is now kept there or maintained in some form because it has this indirect regularization effect and it also has some positive drawbacks.

That has made it necessary to adapt it in certain circumstances, yes, but I was going to make a comment that, if anything, as time has gone on, we seem to rely less on some of these tricks and almost focus more on the large models. more calculation, more data, etc. Do you think there's any truth to that or because you just said it yourself that in many cases people don't even understand exactly what the effect of this is when they do ablation studies that they tend to learn? Actually, this was not necessary after all and is there a trend towards simplifying the process?

I'm not sure there is a trend towards simplification. I think there should be a trend towards simplification. There's a great article that was published around 2020 where they look at all the things that have been done to the Transformers scientifically in these different articles and they find that almost none of them make any difference, yeah, um, and then you know that It is linked to our obsession with the latest technology, you try them all. these different things and then you don't interrogate too closely whether this particular trigger function really was critical because now you know it's two days until the np deadline and you just got the latest technology and um um uh so people you know and then no one goes back and look at these things, but I know when Transformers were introduced, they were very complicated to train, you had to use this thing called learning rate warm-up, um, there are very complicated effects of where you are. put the Norm layer, what happens to the variance and the residual layers, the difference in parameter size, the gradients of the parameter as you go through this softmax function versus the path that computes the values versus the residual, you know you have three parallel links, um and they were very difficult to train and were more based on training things, but actually the kind of next generation trans AR training transformers, which I guess is what you're implying when you talk about very large models.

I don't know if they've removed, I just don't know the answer to your question, yeah, I mean, it's something to do with this graduate student ancestry that we've been talking about in general and I mean, I was, I was um. talking to some people at neps and I mean, Hinton, for example, was always a big fan of contrastive learning where you know you contrast one image with all the others in the data set and then I think um uh laon in Fair Fue pioneered this non-contrastive approach, which is where essentially, I think it's similar to Siamese learning, but you mutate pairs of images, so you're like pulling images from the array rather than comparing them. to everything else and there are as many variations of that as there arebow twins and um, you know, God, God knows what else it's been a while since I've looked at it, but um, these things are just millions and millions of minor engineering variations. an idea, yes, and it's interesting that this is now what we consider results at a scientific conference, we don't necessarily value Insight.

There are articles looking for information, but it's a smaller community that is being left behind and, I mean, personally. It's part of the reason I stopped doing experimental research, uh, was that, I really love ideas and I really love, I love trying to understand what's going on, but it often takes, you know, descending like graduate student, it takes the supervisor, you know, an afternoon. to come up with an interesting idea and then, you know, 20 minutes to persuade the graduate student that it was really their idea, then PhD supervisors, you know what I'm talking about, and then they go away for six months experimenting to get the criteria they need.

I need to participate in the conference and I don't think it's a good use of scientific time or public money. It's probably a good use of Google's money and Open AI's money, but it's a good use of Stanford's money. Maybe not, maybe they should. do something that's a bit more scientific and has some genuine ideas, don't try to increase this benchmark by another 0.1% because often it's just kind of a fluke, plus just a couple of weeks ago there was an article that argued that it's actually as good as Vision Transformers in um uh or certainly comparable to the best Vision Transformers in um uh the image network classification task, it's just that no one has pre-trained them with the huge training database of Google before and has implemented them on such a huge scale, so we can't even necessarily trust all of these results.

I mean, we can trust them in the sense that, you know, in 99% of the cases they did what they did and got the results that they got, but we can't trust their scientific conclusion that Vision Transformers are much better. than convolution until someone has done some science properly to do an exactly comparable experiment and we're not very scientific. We community have gotten better in the previous decades, you know, it was crazy that people would make claims that Well, I invented this new method, but in reality they changed like seven things and the new method was often not what made the performance.

It will improve, and we have improved a little bit because now we do ablation studies and yes, but Basically, we are still not very scientific, you know, these are people trained as engineers, not 100% scientists. And even when they do ablation studies, it's very expensive and now it's become a game that only the best players can play and, uh, open AI. for example, I mean they've done some very interesting research over the years, you know about scaling laws and rhf and you know how to trim GPT and blah blah blah, but ultimately you know and I don't want to sound cynical, but the success of their methods is so much data that they have created Google, the next organization they are hoovering up all this data and I think a lot of people underestimate how valuable the data is from people who use GPT chat, so they are hoovering up . data absolutely, there's a reason they let people use it for free, oh 100%, like that, like that, it's like it's one step away from a mechanical spin, so in the background they're just generating more and more data and, um, they've farmed. this image that they're doing science and for reasons that we just discussed I think they're doing engineering and um yeah, I think it's, I'm, I'm really looking for something different now yeah, no disrespect to engineers, it's certainly incredibly hard to design GPT on that kind of scale, yeah, um, but yeah, again, it's not very intellectually satisfying, sometimes those articles don't have a lot of ideas that you can get from them other than I guess it's really interesting that you keep adding data, you know that it continues to escalate, yes, it is not something that we would have necessarily predicted, so the amount of data is perhaps the limit in this type of models. not even the calculation or the size of the model will literally be the amount of data that can be extracted from the Internet.

We'll see what happens, but yes, there often aren't many ideas and I'd like to reiterate that you can get most of them. of the interesting phenomena of deep learning in 40 dimensions, so if you want to do science or you want to try a new idea, don't even try it on amness, try it on mnist 1D and if it works there, then you know, then try to scale. and if it's not an interesting idea, yes, when you try it on your small data set in 40 dimensions, where no one cares what the state of the art is, then it's worth writing a scientific paper about it.

I know, I think one thing. What happened with GPT is that it transgressed The Uncanny Valley and you know we were talking earlier about how people psychologize and use all these other adjacent mental frameworks to understand what's happening, the reason why people are so excited about the singularity and AGI and everything. This is all because we now have this artifact, which is remarkable and it just memorizes all the data on the Internet and I think cognition involves more than just knowing that you know, just being able to retrieve information. I think it's something that is very, very good used interactively by humans, so I'm using it like I would use Google and I'm looking for information and it's doing things that are remarkable in some way, you're still the intelligence, even though that's what you're asking. .

Interest in something is not how you respond 100%. And that's why when you set it to an Autonomous Array, something like Auto GPT or even have it do something on a schedule, the magic goes away, it's like this uncanny valley or you. I know like suspension of disbelief or whatever, it just disappears instantly and it does stupid things and you're like, oh, I can't believe I've been fooled by randomness. There's like the ux and the fact that it's an extended mind. It's what you know David Charmers and Andy Clark called extended mind otolaryngology essentially, so it's like a cognitive element as part of my physical process of cognition, but a lot of people get fooled in that weird way, yeah, I think.

You have a higher um I think you have a higher opinion on it than I do. I like that my experiments with him have not been very successful. I have repeatedly asked it to generate a story given some premises, almost always. generates the same story even with the same name of the protagonist and knows very well about some things because there is a giant corpus of data, you know, and unfortunately knows very well about the things that computer scientists know because they are present in abandons on the web , so if you want to ask him about search algorithms, he actually knows about search algorithms and can write code for them because there is repository after repository that implements different sort code algorithms, but you know, once you get to the margins of your training data then you freak out or you know they're trying to stop you from doing it so it completely falls apart yeah I agree you've crossed an uncanny valley and it's obviously the first thing that captured the imagination of the public.

The first time you know I'm going to talk to my non-AI friends, I might have non-AI friends asking me about, do you know anything about GPT chat? And interestingly, they often have very strong opinions about how it works. works, you are completely wrong and often don't ask the domain expert what's going on, which I also find curious, so people have their own theories. Well, I guess you know why, like SG and Hinton. They're saying it could lead to superintelligence and I guess based on the conversation we just had, we know that we're interpolating a variety of data and we know that from a computational standpoint it's been solved. calculate a fixed width, etc., so how could it be a superintelligence?

I do not think I can. I mean you would have to start somehow manipulating some of your internal representations to make them more coherent to generate new ideas and then test them and incorporate them into your body of knowledge; you really have no way of learning anything new other than your context Vector, which you forget every session, yeah that's right, even if you had a way to manipulate the information you had input into that convex into that context vector um and formulate it into something new, you know, maybe make some logical deductions about it or come up with some new hypothesis based on it, there's no way to even remember that, so we're missing all kinds of parts of the puzzle.

I don't see how you can look at that and think, oh, superintelligence is just around the corner. um, yeah, I mean, what question for you? I don't like to talk about superintelligence because I think it's only a UNT thing to talk about, but we could talk about AGI, so how far away is AGI? Well, I guess part of the reason I find it strange is that, you know, I think these people think that there is such a thing as pure intelligence and if you just learned a variety of data, then that should be placed because that's the data that we've produced in our physical environment here on the planet, so you know this has all been reduced to a representation of data and we're interpolating it, like you said.

It's non-interactive, so it can't search for new information, and it's non-reflective, so it can't reflect on its own context or anything like that, so it's simply an information retrieval system that has become part of our cognitive process yes, I don't even like the word intelligence, I think capacity is a better word to use, it has certain capacities, some of those capacities are better than humans, I can tell you right now the recent history of epigenetics that comes to mind. and I can't do that, I think it would be much better to stop using the word intelligence and start talking about capabilities because that is something very concrete, you can say that it can do this task or it can't do this task and what it does.

AGI means that for a broad set of abilities you can make a large enough proportion of them that we think this is general intelligence, even if it is measurable. I'm not a big fan of that and I guess. I will explain why, yes, I mean intelligence, I love it, this Chinese proverb of Mount Luu that pays. Wang talks about it being intelligence research and basically saying that it is a complex phenomenon that is beyond our cognitive horizon and just like this mountain range. Monte L, depending on your perspective of the phenomenon, the range looks completely different, so you get these different perspectives, like capacity.

Principle of behavior structure. Well, I guess I told you my perspective, but the reason I don't like it is because ability is like you know a behavioral interpretation of intelligence and the reason I don't like it is because GPT doesn't do anything, you know. , because all cognitive processes are in some sense externalized and physical, etc., and you know, we, we, enact a cognitive process when we use it, so the artifact itself in some sense has no measurable capacity. I guess I disagree with the premise that GPT doesn't do anything right, but it doesn't mean it's just a bunch of, I mean, um.

I suppose this is a rather esoteric externalized cognition argument because one could argue that a chess computer, like the meaning of calculus, is in its use, so it intrinsically does nothing. You know, we use it as part of our physical cognitive processes. give an example of something that does something this is a good good I think you're using you're using this word in this sentence in a very specific way that I don't quite understand we do things we dive, you know, we, we actually do things and, again, maybe I'm just being a human chauvinist, but I'm all for it being integrated into our cognitive ecosystem, but hey, Alo GPT doesn't do things very well, but it does.

Well, although I don't know if it does, I mean, yeah, like in the way a cat flap does things, you know you can make it do things, it can execute code, etc., but I wouldn't say it has agency, no. it does, I mean, you know, an agency is about being able to wrestle with the complexity of your environment as an observer with limited computation, as wri would say, so you know it's about that kind of complexity differential between the situation where I am and the information I have access to, and I feel like you're going to reject this, but how do you see reinforcement learning?

Isn't that exactly what you're talking about? there is literally something called an agent that has agency maybe just through some sort of prey algorithm that does something random until it gets some reward, but it explores its complex environment, sometimes it solves problems because it gets direct rewards, sometimes because of background what we have. put it as uh essentially curiosity um don't you see that as doing something I don't know I don't see itso dramatically different from the cat flap scenario I guess that's you I mean, I love this Frisian idea that agency is actually, a description of a certain dynamic, you know, is that you get this kind of information density place where a thing is effectively planning many steps ahead, so you could say, therefore it has agency, so in the case of reinforcement learning of the thing, I What I don't like is that we can talk about free will , but someone has designed a reward function.

The final chapter of his book is about ethics and algorithmic justice, and right now there's an interesting mix between short-term security. and long-term security, in fact, I am concerned that the word security has become confused in its meaning and people are using one word for long-term and short-term risks and in your book I think you focused mainly on short-term risks term, as you know, bias, misinformation, equity and things like that, but yeah, tell us about the chapter properly, so yeah, let me address the chapter first and because the chapter was co-written, by which I mean it did.

All work with a researcher named Travis Laqua who is at Delouser University, which is in eastern Canada. I didn't personally feel like an engineer. I did not feel personally confident that I could write an intellectually coherent chapter on ethics. uh, but they really encouraged me to do it. I mean, it's kind of interesting that in an engineering textbook we now need a chapter on ethics and that's because of the kind of force magnification of this technology, it's very powerful. You can read this book and go build things. that is not good for society and that is why it is imperative that you at least think about ethics.

I'm not trying to tell you what to do, so I wrote this chapter with Travis who, by the way, will publish his own book next year on values alignment. that people should review um and he has a background in philosophy so we go through a lot of iterations of him using long philosophical words and I try to boil that down until engineers can understand it easily and we cover some aspects of this is commonly discussed in um uh in our world uh how you say the explainability of bias things like that some more philosophical things about moral agency and um uh value alignment uh that are most likely discussed in your field uh I think I only let you use the word epistemology once as a commitment um uh but I think what's interesting about that chapter is more um the kind of conclusion that is um uh you know, it's a call to scientists to realize that uh you have to take um be conscious and take responsibility for the actions that you take and you think you're just doing math and this has no values, but everything you do is, uh, they would say Laden value, and that's often deeply embedded in what we do, so that you can't even see it. , so the very fact that we judge our articles based solely on the latest in technology, you know if they have the latest in technology and we do not report if they are fair, if it is explainable how much carbon must be emitted to train them or run them.

No, that's a value embedded in our community that we don't necessarily think about, so the book ends with a call to In a way, interrogate yourself and think about the impacts of your work, what communities it will affect. Individual scientists don't necessarily have a huge amount of control, but you do have some control, you can choose who to work for, you can choose which problems. you choose to work on um in a sense it's a problem of collective action and act in the sense that individual people don't have much control but collectively you can do things if you organize yourself and decide that this is the kind of thing that we should pursue or not pursue. um uh and that's how the book concludes and then I would like to divorce the discussion about that from any of my personal views on what the risks of a uh AI are uh that do not represent Travis's views and nor do they represent the views of view of the University of Bath, let's have that on camera yes um, um, the book focuses on problems almost uh and largely biases explainability etc., and lightly touches on problems close to society, uh, mainly because they are much more concrete and easier to address than distant problems, um, I think there is something good or more difficult to talk about depending on how you look at it, yeah, okay, yeah, so where do I mean?

That's what I mean. I think there are actually concerns at all levels, from people who are concerned about AGI and beyond AGI. I'm not even going to use the word um uh until intermediate social levels to level you out. I know the problems that exist right now and I inherently disagree with anyone who says that all the problems, you know, we have to focus totally on this and that takes oxygen away from the discussion about the other thing: we are essentially an extremely rich country. community, you know, if we put in for every hundred million dollars of venture capital, we would have one person working on one of these problems, which would be equivalent to a lot of people and we could study this at all levels, I mean, at the very level distant from AGI and the word I'm not going to mention, it's very difficult for me, I think they undermine their own arguments by being so inconsistent and on the one hand you know you can read books about this and so on. on one page they say we shouldn't anthropomorphize to anthropomorphize, we shouldn't uh, on one page they say we shouldn't anthropomorphize, uh, systems are smarter for us and then on the next page they say well, obviously you'll want to do this and you'll care about do this and will try to stop us from doing it and very often the arguments are a bit incoherent and then big assumptions are made often people say well obviously you will have free will or you will obviously have a sense of self and honestly I don't want to get involved at that end of things.

I don't think I have anything more intelligent to say about it, but that doesn't mean we shouldn't worry about it. I think it's really great that people care about short-term issues, bias, and explainability. I'm pessimistic about explainability. More optimistic about bias. I think what's not talked about a lot is how do we get this technology into society so that we can replace a huge number of jobs and it's not the way Jeff Hinton said it, there won't be any more radiologists, so What will happen is that one radiologist will be able to do the work that 20 radiologists used to do and that's it.

That is already happening now. Gmail automatically completes your emails, so if your job is mostly sending emails, you know you can do it 1% faster and need 1% fewer people to do it. And we already have the latest version of chat gbt and uh Dary 3 can do a lot of work, which will reduce the number of people we need to accomplish things in companies and I think it will cause a lot of knowledge workers, human resources managers , become unemployed in a short time, um McKenzie, I think in 2018 it was estimated that there could be 800 million people unemployed by 2030.

I don't know if they are sticking to that, if anything I would say we are moving faster than that we could have anticipated and This is really worrying for society because, one kind of important reason, one important factor that causes civil unrest and instability is people who used to have high status, but now they've lost that status and you're going to create that mass situation very quickly. People have made a deal with society where they put off making money for a long time so they can train to become lawyers or become doctors, and now suddenly their status has disappeared.

They know we just don't need them. There are already many lawyers or doctors. I think we will eventually create new jobs to replace those people, but technology is coming at us very fast like a wall and I don't think society can adapt, so actually for completely different reasons than AGI. People think we should try to slow down the implementation of some of this technology in society, yes, so it's interesting that you have that option. I'm a little more skeptical about the job replacement and weakening part, mainly because I think people are still overestimating how good this technology is, so, for example, we've had copilot, which is something that helps engineers software to generate code and I don't think you'll really find any software engineer who will tell you that they will replace them. in fact, most engineers don't even think this makes their job much easier and that's because there are so many layers of complexity to being a software engineer and having code autocompletion doesn't really even move the needle on syntax. programming isn't the hard part, the hard part is organizing the code base in a sensible way, oh absolutely, and I think you know I don't want to trivialize that there will be huge job displacements and market changes etc., but um , but there is I don't know?

I still think the traditional risks related to fairness and algorithmic bias are still number one now. I'm just reading this book called Broken Code which is about Facebook and how they did all their internal engineering. and there was a real revolution on Facebook after 2011, when they changed news from being chronological to being some kind of recommendation system and everyone hated it at first, but they imposed it and essentially any kind of human curation on many different parts. from Facebook were removed, whether it was the news feed, for example, they removed it and they almost wanted to remove it intentionally because they didn't like it when people said that they had a conservative bias or something and that meant that they could absolve themselves of any responsibility because the algorithm he's doing it and he's just giving the people what they wanted and created a sewer; there were all kinds of horrible things brewing on their platform that the leaders had no idea about and in many cases denied. many years and I think it is a real risk that we are starting to trust algorithms to do things for us.

I think it's a risk, but I will point out to you that you just did exactly what I warned you about, which is I think there are problems at all levels, there is what you know, there is a big problem with bias and with the algorithms that filter our information. that we get, but that doesn't mean that the problem I mentioned five minutes ago isn't important and isn't worth talking about, but that's how the dialogue always goes: you say oh, bias is really important and I say, oh no, no, unemployment is important and no, we need to talk about both.

It's um, you know, it's kind of me, your reaction is exactly everyone's reaction, which is to minimize employment, which no, it's just talking about what matters most to them to the detriment of other things. I don't want to say for a moment that the things you talk about aren't important, but I want to expand that discussion. The reason I bring up unemployment is because I think it's not talked about enough and I think it's almost inevitable or would you have to give me really powerful arguments to tell me that people won't be able to do their jobs more efficiently in the next few years and that will not end with much? of people you just don't know maybe maybe you haven't been fired maybe you just haven't been hired in the first place young people you just don't know uh uh if you're a designer at a company you know, let's imagine you make greeting cards and I used to hand draw those greeting cards, but now you draw one in the style that you like and then you just draw the next 10 and say you know, fill them in with the style, yeah, we have, you know, you just got rid of.

From a lot of greeting card designers, I don't think there's a good argument against that people don't need greeting cards anymore or that it will take a while to find more human needs for more people to fill, yeah I guess, and I think there are things that governments, you know, we could be thinking about how to approach this, how, you know, one thing that could slow it down would be to make companies actually responsible for the effects of their products, so, yeah You know, once we leave, Mike will tell you the recipe for the nerve gas and you will use it.

I will be an accessory before the fact and when they arrest you, they will arrest me and we will both go to jail, but if Chat GPT does that. I'm pretty sure it's open and I have a clause in their terms and conditions that you didn't read that says they are in no way responsible for that. If I started holding them responsible for that, that would be a way to slow down the process. lower things down and make things safer, and I have no doubt that you know people could still make a lot of money in this environment, and you could have something stricter, you know, they're starting to get to this executive order from Joe Biden. he is beginning to take a closer look at these models.

You have to think a little bit about how the taxes are going to work, so, well, could I touch on something you said, Richard? I think we shoulddistinguish between the diffusion of skills. versus agency, so I think right now these things are glorified calculators, me too, but they do calculations, they got rid of the people who used to calculate and didn't do it and that's very true, but right now all the existing frameworks that we have around I automate something, I create a software service to do something, um, because what things don't do because the real concern if it ever happened is the automation of our agency, and from an algorithmic fairness standpoint, yeah GPT was actually creating a skills program and it was being deployed on a server without any human oversight and without doing things and no one was responsible, that would be a big problem, but I don't think it's a problem, no, I don't think it's a problem still anyway, although there may come a time when we gradually sow control because little by little the computer does it a little better.

There's also the sort of real risk of disqualification, so if, for example, we have self-driving cars that drive themselves most of the time. the moment you get less experience driving and then it gives you control at the exact moment which is quite dangerous and now you have only been driving for a few hours in the last year instead of hundreds of hours and you have less experience. as connected to me, you have gradually given up on your agency and that is a real thing, there was an Air France accident in the South Atlantic that was attributed to exactly that, the autopilot was activated, as I understand it, the situation, um, and it turned out . that the pilots actually didn't have much flying experience over the last year because most of the time the autopilot works and, more importantly, they didn't have experience in near-dangerous situations because the autopilot was good enough to handle it and released them in an extremely dangerous Ambiguous Situation where the sensors were not working, so there is a chance that we will seed our abilities gradually and will only be called upon to use them in emergency situations, which is not good.

Sorry, I don't think it's there. you were talking about, but no, it doesn't mean that's really important, so I agree that it's a big deal, although you could argue that making some kind of skill easily automated because we're getting into the same purpose and moral value here and um, you might argue. that society right now recognizes that doing trivial work has value and in the future we may not see it that way, but I agree that the problem is the slippery slope, which is that they are still doing the same work , but you are eroding their free will and becoming automatons and you are setting them up to fail.

Yeah, I mean, I think that might be happening in certain types of storage jobs that you're essentially using them because you can't build. Robots are pretty cool now, but, um, yeah, keep talking, what's the solution to that? Because it almost seems like a light position to say well, let's keep them busy in the warehouses instead of freeing them up to do something different, I mean. you're asking me for solutions to problems that no one has solutions for um I think there's a sense of purpose and a sense of you know human beings in me this is really my personal opinion no science is made to work and yes , they are.

They are not happy if they do not achieve things. It is a very common phenomenon. People retire and really love it for the first year and then the second year they feel purposeless and alone, especially if their job was something you already know. It kind of corresponded to their sense of themselves as a doctor or something, so they feel purposeless and unhappy and we're getting closer to the fact that there are certain people, probably like you, who are self-taught who are going to go crazy at the moment. in which they have no responsibilities. I love it, my number one dream and so do I, but unfortunately the whole world is not built.

You know, we're not at all a representative sample of the population, so that's fine, but I guess I'm saying that there's an element of paternalism to this argument, I mean, I agree with you right now and it might be because of the way that the work has been implemented for centuries that our society and recognition has built a purpose around that and sometimes you have to do strange and strange things toIt takes quite a while for society to change and are we saying that I almost want to keep the status quo? That is, to somehow stop the progression of what will come next.

I mean, I think you're suggesting a grand experiment that we entered into. a phase of chaos and society reorganizes itself around different principles well, well, no, but that's a really interesting point because it gets to the core of what we're discussing also with long-term risk, which is well, the word Risk is actually the key. The operative word here is: do we allow scary things to happen or should we prevent them from happening? Well, you know, this is an interesting thought experiment, um, let's do a thought experiment and I'll put you on, go ahead, I'll put you on. the place where you've been asking me tough questions all afternoon, okay, there's a switch and you can flip it and it lets you know this isn't really science fiction, we know this is possible, or most people agree agree that it's It's possible uh there's a switch that you can flip and AGI is created just to human specifications so you know you can create a Tim um and let us know let's make it something concrete so that Tim can produce at the speed of chat with gtp4 now and you can draw pictures at the speed of uh Dary 3 and you can communicate almost instantly the way that computers do that we already know and they remember an enormous amount of facts and you have the option and Tim the technology so that Tim 2.0 will be randomly assigned to a big tech company and let's say one of the nuclear powers that you spin the wheel on is going to go to one of them, you don't know which one, yeah, and your question is: do you flip the switch?

You flip that switch knowing that tomorrow they're going to generate 100,000 Tims on their servers and make them do something, hopefully, just to maximize their shareholder value. What would you do? I think this is a great example that we can't predict the future and you know the road to hell is paved with good intentions and there are so many situations where we trust our moral intuitions and even long term supporters argue that they believe they can see the future and in this particular case it could easily end up being a good thing, so you're saying you wouldn't have answered the question well.

No, the honest answer is no, I don't know, but I guess this is a great example. where um, our moral intuitions are almost always wrong, oh right, so I don't know the answer, what, so the intuitive answer. Most people would say no, don't do it. I think I would also say no, yes, but and and you. Answer simply yes, I'm pretty ambivalent to be honest, yes, but you have to do it, yes, the switch is there, you either flip it or you don't, there's no, there's no being ambivalent means no is the only reason I wouldn't flip it. en You know, it's a question of personal identity in philosophy, so I don't want there to be too many other Tims out there.

If there were, if it were you, I would choose you because there would be no psychological continuity with the other Tims. So the reason I think everyone who works in AI should think about this question is that if you work for a company that has AGI built into its core, if you work for an open mind, an open mind, an open mind, or a deep AI, an open AI or a Deep Mind. Who do you know who literally writes that AGI is their mission? Yeah, you're flipping that switch a little bit every day and you should think if you know, if you have this big diffusion of responsibility, well, it's not me, there are thousands of people. everyone working on this and, but you're flipping that switch, it's moving towards the middle point to the point where it just moves up or down depending on which continent you're on, so you should ask yourself that question if You are actively working towards that goal and if you, like most people say, the answers are no, maybe you should work for a different company.

Anyway, Professor Prince, it has been an absolute honor. It has been a real pleasure speaking with you. I have observed it. many of your podcasts, I have learned a lot from them and, amazingly, I am looking forward to the ones that will come out in the future. Thank you very much for joining us today, thank you.

Watch Video & Subscribe