Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108

Mar 30, 2024

The following is a conversation with Sergey Levine, a Berkeley professor and world-class researcher in deep

learning

, reinforcement

learning

robotics

, and computer vision, including the development of algorithms for end-to-end training of neural network policies that combine perception and control of scalable algorithms. for inverse reinforcement learning and, in general, deep r.l. algorithms. quick summary of announcements for cash app and expressvpn backers, consider supporting the

podcast

by downloading cash app and using Collects Pot Cast and registering at expressvpn comm/flex pod, click the links, buy the stuff. it's the best way to support this

podcast

and generally the journey I'm on.

If you enjoy it, subscribe on YouTube, review it with five stars. An Apple podcast. Follow on Spotify with support on Patreon or connect with me on Twitter at Lex Friedman as usual. I will do a few minutes like now and there will never be any ads in between that can interrupt the flow of the conversation. This program is brought to you by the Cash App, the number one financial app on the App Store when you use it. Colex Podcast Cash App. allows you to send money to friends, buy bitcoins, and invest in the stock market with as little as a dollar, as Cash App performs fractional share transactions.

More Interesting Facts About,

sergey levine robotics and machine learning lex fridman podcast 108...

Let me mention that the order execution algorithm that works behind the scenes to create the abstraction of fractional orders is an algorithmic marvel, so great that Cash App engineers are taking a step towards the next layer of abstraction over the market. of securities, making trading more accessible to new investors and diversification much easier, so again, if you get cash from the App Store, Google Play and use the code lex podcast you get $10 and the cash withdrawal will also donate $10 the first an organization that is helping advanced

robotics

and the education of youth around the world this program is also sponsored by expressvpn get it at expressvpn comm / Lex pod to support this podcast and to get three extra months free in a package of one year.

I have been using expressvpn for many years. I love it. I think expressvpn is the best VPN out there. They told me to say it, but it turns out it's true. humble opinion, it doesn't lock your data, it's very fast and easy to use, literally just a big power button again, it's probably obvious to you, but I have to say it again, it's very important that they don't log your data, it works Linux and all the other operating systems, but Linux, of course, is the best operating system. Shout out to my favorite flavor Ubuntu mottai 2004, once again get it on expressvpn calm/relax pod to support this podcast and get three extra months free in one year package and now here's my conversation

sergey

Lavigne what's the difference between a next generation human like you and me well, I don't know if we rate Stata, they are humans but they are next generation human and a next generation robot is a very interesting question, the capacity of the robot is something like something that I think It's very complicated to understand because there are some things that are difficult that we wouldn't think are difficult and some things that are easy that we don't think would ever be seen and also there is a really big gap between the capabilities of robots in terms of hardware and their physical capacity and the capabilities of robots in terms of what they can do autonomously.

There is a small video. I think robotics researchers really like to show a special to robotics learning researchers like me from 2004 from Stanford that demonstrates a prototype robot called PR and the PR was a robot that was designed as a home assistance robot. and there is this beautiful video that shows the pr1 tidying up a living room, putting away the toys and at the end bringing a beer to the person sitting on the couch, which looks really amazing and then the joke is that this is completely controlled by the person, yes, so you can, in some ways the gap between a next-generation human and a next-generation robot, if the robot has a human brain, it's actually not that big now, obviously, as human bodies are sophisticated and very robust and resilient in many ways, but in general, if we are willing to spend a little money and do a little engineering, we can close the hardware gap almost, but the intelligence gap is very wide and when you say hardware you mean the physical. type of actuators the real body the robot is opposed to the hardware in which the cognition the nervous the hardware of the nervous system yes exactly I mean the body instead of the mind and what does it mean that the type of work is done for us, while we can still improve the body, we know that the big bottleneck right now is really the mind and how big that gap is, how big the difference is in your sense of capability. to learn a little bit the ability to reason the ability to perceive the world between humans and our best robots the gap is very large and the gap gets larger the more unexpected events can happen in the world, so essentially the spectrum a along which you can measure the size of that gap is the spectrum of how open the world is, if you control everything in the world very strictly, if you put the robot like in a factory and tell it where everything is and rigidly program its movement , then it can do things you already know.

You could even say in a superhuman way that you can move faster, that you are stronger, that you can lift a car and things like that, but as soon as something starts to vary in the environment, you will stumble and if many things vary as you would like . in your kitchen, for example, then things are quite open. Now again, we'll focus a little bit on the philosophical questions, but how much of the human side of cognitive abilities in your sense is nature versus nurture? So how much of that? It is a product of evolution and how much of that we will learn from scratch, yes, well, from the day we were born.

I'm going to read your question as asking about the implications of this for AI by biologists. Can't. I really like to speak with authority too even in maroon if so if it's about learning then there's more hope for me so the way I look at this is that first you know well of course biology is very complicated and it's if you do the question How does a person do something or how does a person's mind do something? You come up with a lot of hypotheses and you can often find support for a lot of different, often contradictory, hypotheses in a way that we can address the question of what the implications of this are for AI R.

Can we think about which is enough for you to know that maybe a person is very good from birth at some things like, for example, recognizing faces? There is a very strong evolutionary pressure to do it, if you can recognize your mother's face then you are more likely to do it. survive and therefore people are good at this, but we can also ask what is minimal sufficient and one of the ways we can study minimal sufficient is that, for example, we can see what people do in unusual situations if presented. of things that evolution couldn't have prepared them for, you know, our daily lives actually do this to us all the time, we didn't evolve to deal with, you know, cars and spaceflight and whatever, it's all these situations in which we can find ourselves and we do very well.

They say I can give you a joystick to control a robotic arm you've never used before and you might be pretty mean for the first few seconds, but I do tell you that you like your life. depends on using this robotic arm to open this door, you will probably succeed even though you have never seen this device before using the joys to control us and somehow you will succeed and that is not your natural evolved ability, that is your fear, flexibility, your adaptability and that's exactly why our current robotic systems really fail, but I wonder how general is almost what we consider common sense pre-trained models underneath all that, so the ability to adapt to a joystick requires that you have a kind of knowing that I'm human, so it's hard for me to introspect on all the knowledge I have about the world, but it seems like there might be an iceberg underneath how much knowledge you actually bring to the table.

Now that's an open question, there is absolutely an iceberg of knowledge that we bring to the table, but I think it's very likely that that iceberg of knowledge accumulates over the course of our lives because we have so much prior experience. To leverage, and it makes sense that the right way for us to optimize our efficiency, our evolutionary fitness, etc., is to use all that experience to build the best iceberg we can get, and actually, that's one that you know well. That sounds an awful lot like what

machine

learning actually does. I think for modern

machine

learning it's actually a big challenge to take this massive unstructured experience and extract something that looks like a common sense understanding of the world and maybe part of that is It's not because something about machine learning itself is broken that is difficult, but because we have been too rigid in subscribing to a very rigid and supervised notion of learning, you know, the excess of input and output is why order. of model and maybe what we really need to do is see the world more as a massive experience that doesn't necessarily provide rigid oversight but provides many examples of things that could be and then you take that and distill it into some kind of common sense understanding .

I see what you're painting a beautiful optimistic picture, especially from a robotics perspective, because that means we just need to invest in better learning algorithms, figure out how we can access more and more data for those who learn L is going to extract signals. and then accumulate that iceberg of knowledge. It is a beautiful image, it is hopeful. I think it's potentially a little bit more than that and this is where we perhaps reach the limits of our current understanding, but one thing that I think the research community hasn't really worked out satisfactorily is how much it matters where that experience comes from, as already You know, just download everything on the intranet and essentially bring it into the 21st century. analogue of the giant language model and then see what happens or really matters if your machine experiences the world or in a sense it actually tries things, observes the result of its actions and somehow augments the experience that way by choosing which parts of the world you interact with, observe and learn correctly, it may be that the world is so complex that simply obtaining a large mass of iid samples of the world is a very difficult path to follow, but if you are actually interacting with the world and essentially performing this kind of intense mining by trying what you think might work, observing the sometimes happy and sometimes sad results of that and increasing your understanding using that experience and you are doing this continuously for many years, maybe that kind of In a sense, the Data collection is actually much more conducive to gaining a common sense understanding.

Well, one reason we might think this is true is that what we associate with common sense or lack of common sense is often characterized by the ability to reason. about counterfactual questions like, you know, if I told you, here I am this bottle of water on the table, everything is fine, throw it away, which I'm not going to do, but if I had to do that, what? would happen and I know nothing good would happen from it, but if I have a bad understanding of the world, I might think it's a good way to like it and gain more usefulness if I actually go about my daily life doing things. which my current understanding of the world suggests will be of great use to me in some way.

I'll get exactly the right supervision to tell me not to do those bad things and to keep doing the good things, so there's a spectrum between iid and random walk. through the data space and then there's what we humans do or I don't even know if we do it optimally, but there could be beyond that, so this open question that you raised, where do you think intelligent systems Could they be capable? To deal with this downfall of the world, can we do pretty well reading all of Wikipedia, testing it randomly like language models do? We have to be exceptionally selective and smart about which aspects of the wall we eat chocolate, so I think this comes first.

It's an open scientific problem and I don't have a clear answer, but I can speculate a little and what I would speculate is that you don't need to be super careful. I think it's less about being careful to avoid the useless. things and more about how to make sure you find the really important things, so maybe it's okay if you spend part of your day guided by your curiosity visiting interesting regions of your state space, but it's important that you know every From time to time, Make sure you actually test the solutions that yourcurrent model of the world suggests that they might be effective and look at whether those solutions work as you expect or not, and maybe some of that is really essential to having some sort of perpetual relationship.

An improvement loop like this perpetual improvement loop is really, but that's really the key, the key that will potentially distinguish today's best methods from tomorrow's best methods, in a sense, how important do you think exploration or the total output? Exploring box thinking in this space is jumping into a totally different domain, so you mentioned there is an optimization problem and you explore the details of a particular strategy, whatever you are trying to solve, how important it is to explore totally outside of the strategies that have been working for you so far, what's your intuition there, yeah, I think it's a very problem-dependent type of question and I think actually, you know that question sort of addresses one of the big differences between the classical formulation of a reinforcement learning problem and some of the more open reformulations of that problem that have been so classically explored in recent years.

Reinforcement learning is posed as a utility-maximizing problem like any kind of rational AI agent and then anything you do is in the service of maximizing that utility, but it's a very interesting way to look at it. Needless to say, it's the best way to look at it, but it's an interesting alternative way of looking at these problems as something where first you can explore the world however you want and then you'll be tasked with doing something and that might suggest solutions somewhat. different, so if you don't. If you don't know what you're going to be tasked with and just want to prepare optimally for whatever the uncertain future has in store for you, then perhaps you'll choose to achieve some kind of cover and build some sort of arsenal of cognitive tools if it will be such that later , when someone tells you that your job is to bring me coffee, you will be well prepared to undertake that task and that you will see it as the modern formulation of the reinforcement learning problem as the most multiple type of problem. task the general intelligence kind of formulation I think that's one possible vision of where things could be headed.

I don't think it's by any means the mainstream or standard way of doing things and it's not like I have to, but I like it, it's a beautiful vision so maybe really take a step back what's the point of robotics what? is the general problem of robotics in trying to solve it you actually painted two pictures here one of the narrow ones is the general one what in your opinion is the big problem of robotics again ridiculously philosophical questions I think you know maybe there are two ways to answer this question: one is that there is a very pragmatic problem which is how to make robots, what would maximize the utility of robots and there the answer could be something like a system where a system that can perform any task that a human user sets for it, which He knows within physical limitations, of course, if you tell him to teleport to another planet, but he probably can't do it, but if you ask him. do something that is within your physical ability, then potentially with a little extra training or a little extra trial and error you should be able to figure it out the same way a human teleoperator should figure out how to do it. driving the robot to do that is a very pragmatic view of what it would take to solve the robotics problem, so to speak, but I think there is a second answer and that answer is much closer to The Reason I Want to Work in robotics is that I think it's less about what it would take to do really good work in the world of robotics, but rather the other way around, about what robotics can contribute to help us understand the artificial. intelligence, then your dream fundamentally is to understand intelligence.

Yeah, I think that's the dream of a lot of people who actually work in this space. I think there is something very pragmatic and very useful in the study of robotics, but I think a lot of people go. In this field, actually you know, the things that they draw inspiration from are the potential of robots to help us learn about intelligence and about ourselves. It's fascinating that robotics is basically the space through which you can get closer to understanding the fundamentals of artificial intelligence. So what is it about robotics that sets it apart from some of the other approaches?

If you look at some of the early advances in deep learning or in the computer vision and natural language processing space, there were really nice, clear benchmarks that a lot of people competed against. and so many construction ideas emerged, what is for you the fundamental difference between computer vision that purely defines a network of images and the larger problem of robotics? So there are a couple of things, one is that with robotics you have to take away a lot of the crutches, so you have to deal with the particular problems of perception control, etc., but you also have to deal with the integration of those things and, you know, classically we've always thought of integration as a kind of separate problem, a class, a kind of modular engineering approaches where we solve individual subproblems, then we connect them together and then everything works and one of the Things we've been seeing over the last few decades is that Well, maybe studying the thing as a whole could lead to very different solutions now if we were to study the parts and connect them together so that the integrative nature of robotics research helps us. to see different perspectives on the problem in another part of the problem.

The answer is that with robotics a certain paradox is highlighted, which is why this is sometimes referred to as more expert in the idea that in artificial intelligence things that are very difficult for people can be very easy. for machines and vice versa, things that are very easy for people can be very difficult for machines, so you know that integral and differential calculus is quite difficult for people to learn, but if you program a computer, you can derive derivatives and whole day without any problem, while some things like you know, drinking from a glass of water is very easy for a person, it is very difficult for a robot and sometimes when we see such blatant discrepancies that they give us a clue very strong that we are missing something important, so if we really try to focus on those discrepancies, we might find something we are missing and it is not that we need to make machines better or worse at mathematics and better at drinking water, but that by studying those discrepancies we could find some new idea.

So that could be in any space, it doesn't have to be robotics, but you're saying yes, I understand it's interesting that robotics seems to have a lot of those discrepancies, so Hans is more of a paradox. is probably referring to the space of physical interaction. I think you said object manipulation, walking, all the kinds of things we do in the physical world. Well, how does it make sense if you tried to unravel the Marwick Paradox, like why? There's such a gap in our intuition about this, why do you think manipulating objects is so difficult based on everything you've learned from applying reinforcement learning in this space?

Yeah, I think one reason is maybe that for a lot of the problems that we've studied in AI and computer science, etc., the notion of input/output and monitoring is much clearer, so the vision Computer-based, for example, deals with very complex inputs, but is comparatively a bit easier, at least up to some level of abstraction. presenting it as a very tightly supervised problem it's comparatively much harder to present robotic manipulation as a very tightly supervised problem, you can do it, it just doesn't work as well, so you could say well, maybe we'll get a label data set where We know exactly what motor commands to send and then we train on that, but for various reasons that's not really a great solution and it also doesn't seem to be even remotely similar to how people and animals learn to do things because we don't tell us, like our parents, this is about how you activate your muscles to walk.

You know, we got some guidance, but the very low-level detailed things we figured out, most of it on our own, and that's what you mean by firmly. Also, each little subaction gets a supervised signal of whether it's good or not, so in computer vision you could imagine up to a level of abstraction that maybe you know someone told you this is a car and this. is a cat and this is a dog in motor control, it is very clear that that was not the case if we look, I said of the subspaces of Robotics that again, as you said, robotics integrates them all and we will see how this beautiful mess instead, but there is still perception, so it is the problem of computer vision, broadly speaking, understanding the environment, then there is also maybe you can correct me on this kind of categorization of space and then there is prediction when trying to anticipate what things are going to happen. what to do in the future so that you can act in that world and then there is also this game theory aspect of how your actions will change the behavior of others in this type of space, what and this is bigger than reinforcement learning .

Just looking broadly at the robotics problem, what's the hardest problem here? individually it's harder because all of them together you should just look at them all together. I think when you look at them all together, some things become easier and I think that's pretty important, so we let you know in 2014 that we had some work to do. Basically, our first work on end-to-end forced learning for robotic manipulation skills from vision, which you knew at the time, was something that seemed a little inflammatory and controversial in the robotics world, but aside from the incendiary and controversial part, the point is that what we're really trying to do in that work is that for the particular case of combining perception and control, you could actually do better if you treat them together than if you try to separate them and the way where we tried to demonstrate this when we chose a fairly simple motor control task in which a robot had to insert a small red trapezoid into a trapezoidal hole and we had our separate solution which involved first detecting the hole using a pose detector and then a actuated arm to place it and then our intent solution which was simply mapping pixels to the torques and one of the things we observed is that if you use the intense solution, essentially the pressure on the perception part of the model is actually less, since it doesn't you have to figure out exactly where the thing is in 3D space. you just need to figure out where you are, you know, distributing the errors in such a way that the horizontal difference matters more than the vertical difference because vertically it just pushes you down until you can't go any further and your perceptual errors are a lot less damaging, whereas perceptual errors perpendicular to the direction of motion are much more damaging, so the point is that if you combine these two things you can compensate for the errors between the components optimally to perform the task better and the components should be more weak while still leading to better overall performance as a deep idea, I mean, in peg space and things like that, it's pretty simple, it's almost tempting to overlook it, but it seems to be, at least intuitively, an idea that should generalize basically to all aspects of perception. control, of course, when one strengthens the other, yes, and we know that people who have studied some sort of perceptual heuristics in humans and animals find things like that all the time, so a well-known example is something called the gaze heuristic , which is a little trick you can use to intercept a flying object, so if you want to catch a ball, for example, you can try to locate it in 3D space. estimate its speed. estimate the effect of wind resistance. solve a complex system of differential equations in your head. or you can maintain a running speed so that the object stays in the same position as in your field of view, so if it goes down a little bit you speed up, if it goes up a little bit you slow down and if you follow the simple rule you actually reach exactly where the object lands and you will catch it and humans use it when they play baseball human pilots use it when flying airplanes to determine if they are about to collide with someone frogs use this to catch insects and so on, this is something that really happens in nature and I'm sure this is just one example that we were able to identify simply because scientists can identify what is prevalent in our Probably many others know who they can approach whileWe talk about robotics.

They have a canonical problem, a kind of simple, clean, beautiful representative problem in robotics. You think about them when you think about some of these problems that we talked about with robotic manipulation. To me, that intuitively seems like at least the robotics community is converging towards that as a space, that's the canonical problem, if you agree that maybe zoom in on some particular aspect of that problem that you like, if we solve that problem perfectly, unlocking an important step toward human-level intelligence. I don't think I have a great answer for that and I think part of the reason I don't have a great answer has to do with the fact that the difficulty is really in flexibility and adaptability rather than doing one thing in particular really well, so it's hard to just say oh, if you can, I don't know, like shuffle a deck of cards as fast as in Vegas.

If you are a casino dealer, then you will be very competent. In reality, it's the ability to quickly figure out how to do something new and arbitrary well enough to know how to move on to the next arbitrary thing. But the source of novelty and uncertainty, have you encountered problems where it's easy to generate new noonah messes? , yes, new types of novelty, yes, a few years ago, if you had asked me this question in 2016, maybe I would have done it? I probably said that robotic grasping is a great example of that because it is a task with a lot of utility in the real world as you will get a lot of money if you can do it well, when does robotic grasping pick up any object with a robotic hand exactly?

So you'll get a lot of money if you do it right because a lot of people want to run warehouses with robots and it's not trivial because very different objects will require very different grasping strategies, but actually since then people have become very good at building systems to solve this problem to the point where I'm not really sure how much more progress we can make with that as the main guide, but it's interesting to see the kind of methods that actually worked well. that space because robotic grasping classically used to be thought of almost as a geometry problem, so people who have studied the history of computer vision will be very familiar that it's pretty much the same way it was in the early days .

In the days of computer vision, people thought about it a lot, it's like a reverse graphics thing in robotic grasping, people thought of it as a reverse physics problem, essentially, you look at what's in front of you, you figure out the shapes and then you use your best guess of the laws of physics to figure out where to put your fingers, pick up the thing and it turns out that what works really well for robotic grasping instantiated in many different recent works, including ours but also those of many other labs, is to use learning methods with some combination of extensive simulation or real-world trial and error and it turns out that those things actually work very well and then you don't have to worry about solving geometry problems or physics problems, so what Are they just by the way? understanding, what are the difficulties that have been worked on, so one is like the materials of things, maybe the occlusions and the perception side, why is it so difficult, why picking things up is a problem so difficult, yes, it is a difficult problem because of the number of things. that you might have to deal with or the variety of things that you have to deal with is extremely large and many times things that work for one class of objects will not work for another class of objects, so if you are really good at picking up boxes and now you have to collect plastic bags, you know you just need to employ a very different strategy and there are many properties of objects that are more than just their geometry, it has to do with, you know, the parts that are easier to collect the parts that are difficult to pick up parts that are more flexible parts that will make the thing twist and bend and fall out of your hand versus parts that resulted in secure grip things that are flexible things that if you pick them up the wrong way, they will fall upside down and the contents will spill out, so all these little details come up, but the task can still be characterized as a task as if there was a very clear notion that you did it. or you didn't, in terms of spilling things, this notion arises that initiates the sound and it feels like common sense reasoning.

Do you think solving the general robotics problem requires common sense reasoning that requires general intelligence? this type of capacity at the human level. Some of you know, as you said, to be robust and deal with uncertainty, but also to be able to reason and assimilate different knowledge that you have. Yes, what do you think about the needs of common sense reasoning in the space of the general? problem with robotics, so I'll sidestep that question a bit and say that I think maybe it's actually the other way around: studying robotics can help us understand how to put common sense into our AI systems.

One way to think about common sense is that and why our current systems might lack common sense is that common sense is an emergent property of having to interact with a particular world, a particular universe, and do things in that universe. , so you might think that, for example, as an image captioning system maybe you look at images of the world and write sentences in English to deal with our world and then you can easily build situations in which captioning systems Image systems do things that defy common sense, like giving you a picture of a person in a fur coat and we'll say it's a teddy bear, but I think what's really happening in those environments is that the system doesn't actually live.

In our world, you live in your own world which consists of pixels and English sentences. and it's not really about having to put on a fur coat in the winter to keep out of the cold, so perhaps the reason for the disconnect is that the systems we have now simply inhabit a different universe and If we build AI systems that look Forced to deal with all the mess and complexity of our universe, perhaps they will have to acquire our common sense to essentially maximize their utility, whereas the systems we are building now don't have to do that.

I can take some shortcuts, that's fascinating. You've already reframed the role of robotics and this whole thing a couple of times and for some reason I don't know if my way of thinking is common, but I thought we need to understand and solve intelligence to solve robotics and you're framing it as if robotics doesn't was one of the best ways to just study artificial intelligence and build something like robotics is like the right space where you can explore some of the fundamentals. learning mechanisms fundamental type of multimodal multitask aggregation of knowledge mechanisms that are required for general intelligence.

Really interesting way to think about it, but let me ask about learning, can the general type of robotics, the epitome of the robotics problem, be solved purely through learning, perhaps? end learning, learn from scratch instead of injecting human experience, rules, heuristics, etc. I think in terms of the spirit of the question, I would say yes, I mean, I think, although in some ways it may be like too stark a dichotomy like, you know, I think in some ways, when we build algorithms, you know that in sometime a person does something like yeah, there's always a person who turns on the computer first, you know, implemented tensorflow, but yeah, I think in terms of In terms of the point you're making, I think the answer is yes.

I think we can solve many problems that previously required meticulous manual engineering through automated optimization techniques, and in fact, I'll say one thing about that. The thing is: I don't think this is actually a very radical or very new idea. I think people have been thinking about automated optimization techniques as a way to do control for a long time and in some ways what's changed is really more of a point so you know today we would say oh my robot does machine learning. , it does reinforcement learning, maybe in the 1960s you would say oh my robot is doing optimal control and maybe the difference between writing a system of differential equations and doing feedback linearization versus training and the neural network, no it's such a big difference, it's just that you know, pushing optimization deeper and deeper into what you think it was, but with especially deep learning that starts to feel the accumulation of experiences in the form of data to form deep representations.

Since knowledge is assumed to be optimal control, it feels as if there is an accumulation of knowledge in the learning process. Yeah, yeah, so I think it's a good point that a big difference between learning-based systems and classical optimal control systems is that learning-based systems and their principles. should get better and better the more they do something well and I think that's actually a very powerful difference, so if you look back at the world of expert systems, it's symbolic AI and so on, the use of logic to accumulate experience, human experience, human coding. experience, but do you think that will play a role in some points where you know deep learning, machine learning, reinforcement learning has had incredible results and the breakthroughs not only inspired thousands, maybe millions of researchers, but which you know is less popular now, but used to be part of the idea of symbolic AI, do you think it will play a role?

I think, in some ways, the sort of descendants of symbolic AI. Actually, I already have a role, so you know, this is the highly biased story from my perspective, you say that, well, initially we think that rational decision making involves logical manipulation, so it has some model of the world expressed in terms of logic, you have some query like what action should I take to make and statements that have true or false values, you will build probabilistic systems where things have associated probabilities and probabilities of being true and false. not spinning Bayes networks and that provided a kind of impetus to what we really are, you know, still essentially logical inference systems, just probabilistic logical inference systems and then people said, well, let's learn the individual probabilities within these models and then people said, well, let's learn the individual probabilities within these models.

We don't even specify the nodes and models, let's just put a big neural network in there, but in many ways I see them as descendants of the same idea: essentially it's about instantiating rational decision making through some process of inference and learning. through an optimization process, in a sense, I would say yes, it has a place and in many ways that place is or you know it already occupies that place, it's already there, yes, it's just different, it looks slightly different from what was before, yes. but in some cases there are a few things we can think of that make this a little more obvious, like if I train a large neural network model to predict what will happen in response to my robots' actions and then run probabilistic inference , which means that you invert that model to discover the actions that lead to some plausible outcome, as it seems to me a kind of logic, you have a model of the world, it is simply expressed by a neural network and you are doing some kind of procedure of inference. of manipulation in that model to discover that you know the answer to a query that you have, it is the interpretability, it is the explained capacity, although that seems to be missing even more because the good thing about expert systems is that you can follow the reasoning of the system that for us , mere humans, it is somehow compelling, I just don't know what to make of this fact that there is a human desire that intelligence systems can convey to us in a poetic way why it made the decisions it made. like telling a compelling story and maybe that's a silly human thing like we shouldn't expect that from intelligent systems like we should be super happy that there are intelligent systems out there, but if I were to psychoanalyze the researchers at that point, I would say that expert systems are connected to that part of AI researchers' desire for systems to be explainable.

I mean, maybe on that topic, are you hoping that the kind of inferences that are the source of learning-based systems are as explainable as the dream was with expert systems for example, I think is a very questioncomplicated because I think in some ways the question of explainability is very closely related to the question of similar performance, like why do you want your system to be explained well so that it is so simple? that when it breaks you can figure out why you did it right, but it's good, but in some ways it's a much bigger problem, plus your system could break and then it could break the way it explains itself or you could have some errors somewhere, so it's not actually doing what it was supposed to do, so maybe a good way to look at that problem is really as a larger problem of verification and validation of the skills explained, something like that as a component I see.

I just see differently I see the skill explained, you put it beautifully. I think you really summed up the field of explained skill, but to me there is another aspect of explained skill that is like storytelling that has nothing to do with errors or the survey. Don't you use mistakes as elements of your story rather than a fundamental need to be explainable when mistakes occur? It's just that for other intelligence systems to be in our world, we seem to want to tell each other stories and that's true in the political world it's true in academia and that, you know, neural networks are less able to do that or maybe they are equally capable of telling stories, it may not matter what the basics of the system are, you just need to be a good storyteller, maybe a specific story I can tell you in that space is actually about a job done by my ex collaborator, who is now a professor at MIT, named Jacob Andreas.

Jacob actually works in natural language processing, but he had this idea. work a little bit on reinforcement learning and how natural language can basically structure the internals of policies trained with RL and one of the things he did was set up a model that tries to perform some defined tasks for a reward. works but the model reads in a natural language instruction, so this is a nice common thing to do when following instructions, so you say it like you know how to go to the Red House and then you're supposed to go to the Red House, but one of the things Jacob did was treat that sentence not as a command from a person but as a representation of the internal kind of state of mind of this policy, essentially so that when faced with a new task, what I would do is basically try to think of possible linguistic descriptions, try to do them and see if they led to the right result so I would think out loud like you know I'm faced with this new task, what am I going to do?

Let me go to the red house now that didn't work Let me go to the Blue Room or something let me go to the green floor and once I got some reward I would say oh go to the green floor that's what's working . I'm going to go to the green plant and then you can look at the rope that appeared. and that was a description of how he thought he should solve the problem, so you could do it, you could basically bring in the language as internal state and you can start to handle this kind of thing and then what he was trying to get at. is that also if you add to the reward function the unconvincing story, hmm, then I have another reward signal of people who review that story, how much they like it, I say you know and initially it could be a hyperparameter or some kind of type of coded heuristic thing, but it's an interesting notion that the compelling 'no story becomes part of the reward function' objective function of the skill explained is in the world of sort of Twitter and fake news that could be a scary notion that the nature of the truth may not be as important as convincing "no", something about how convinced you are in telling the story around the facts, well, let me ask you the basic question, you are one of the researchers world class in deeper reinforcement learning and blunt learning certainly in the robotics space What is reinforcement learning?

I think today it refers to reinforcement learning, it's really just the kind of modern incarnation of learning-based control, so classically reinforcement learning has a much narrower definition, which is that it is, literally, learning. from reinforcement as if the thing did something and then received a reward or punishment, but I really think the way the term is used today is for broader learning-based control, i.e. some kind of system that is supposed to control something and use data to improve and what control means is action is the fundamental element yes, it means making rational decisions now and rational decisions are decisions that maximize a measure of utility and sequentially they see many decisions over and over. again now, so it's easier to see that kind of idea in the space of maybe games in the robotics space, see? is bigger than that, it is applicable as a word, if the limits of the applicability of reinforcement learning yes, rational decision making is essentially the encapsulation of AI problems were not solved through a particular lens, for so any problem we would want an intelligent machine to do can probably be represented as a decision-making problem.

It is classifying images, it is a decision making problem, although not sequential. You generally know that controlling a chemical plant is a decision-making problem. Deciding which videos to recommend on YouTube is a decision-making problem, and one of the really attractive things about reinforcement learning is whether it sums up the range of all these decision-making problems. maybe working on reinforcement learning is you know one of the ways to tackle a wide range of AI problems, but what is the difference between reinforcement learning and maybe supervised machine learning so that learning by reinforcement can be seen as a generalization of supervised machine learning you can certainly consider supervised learning as a reinforcement learning problem.

You can simply say that your loss function is negative of your reward, but you have stronger assumptions. assumes that someone actually told you what the correct answer was that your data was iid and so on so you can see that reinforcement learning is essentially relaxing some of those assumptions, now that's not always a very productive way to look at it because if you really If you have a supervised learning problem, you will probably solve it much more effectively using supervised learning. learning methods because it's easier, but you can see reinforcement as a journalist, you know that for sure, but they are fundamentally, it's a mathematical statement that is absolutely correct, but it seems that reinforcement learning is the type of tools that we will bring to the table today.

So maybe in the future everything will be a reinforcement learning problem, just like you said image classification should be mapped to a reinforcement learning problem, but today the tools and ideas, the way we think about them, They are different. The type of supervised learning has been used very effectively to solve basic and limited AI problems, reinforcement learning represents in a way the dream of AI, it is very present in the research space now in two, captivating the imagination of people about what we can do with intelligent systems, but it has not yet had such a broad impact. as supervised learning approaches, my question has a more practical meaning, as what you see is the gap between the more general reinforcement learning and the very specific one, yes, it is a question, decision making with a sequence, a step in the sequence. of supervised learning, so from a practical point of view, I think one thing is potentially a little difficult now and I think one thing we will see is a gap that we could see closing in the next few years.

It's the ability of reinforcement learning algorithms to effectively use large amounts of prior data, so one of the reasons it's a little difficult to use reinforcement learning for all the things we'd want to use it for these days. is that in most environments where we want to make rational decisions it's a bit difficult to just implement some policy that does crazy things and learns purely through trial and error it's much easier to collect a lot of data and a lot of logs from some other policy there is and then maybe you know that if you can get a good policy from that, then you implement it and let it adjust a little bit, but algorithmically it's quite difficult to do that, so I think once we figure out how if we get that reinforcement learning is effectively initiated from large data sets, we will see very rapid growth and applications of these technologies, so this is what is known as off-policy reinforcement learning or offline RL or Batch RL, and I think we're seeing a lot of it. of research at the moment that is bringing us closer and closer to that.

Can you maybe paint a picture of the different methods he said outside of policies? What is value-based reinforcement learning? What is based on policies? reinforcement, yes, one way we can think about reinforcement learning is that in some very fundamental way, these are learning models that can answer "what if" questions, so what if I take this action What haven't I taken before? and it does so, of course, from experience from data and often it does so in a loop to build a model that answers these hypothetical questions. Use it to figure out the best action you can take and then try taking that and see. if the result agrees with what you predicted, then different types of techniques basically refer to different ways of doing it, then model-based methods answer a question about what state you would get, basically what would happen to the world if take a certain action.

Value-based methods answer the question of what value would you get, that is, what utility would you get, but in a sense they are not that different because they both answer these hypothetical questions, unfortunately for us with current machine learning methods answer Hypothetical questions can be very difficult because they are actually questions about things that did not happen. If you want to answer hypothetical questions about things that did happen, you wouldn't need to learn the model you would just like. Repeat what worked before and that's really a big part of why RL is kind of hard, so if you have a purely political online process, then you ask these questions of what happens if you make some mistakes, then you're going to try to make those. errors in things and then you'll look at sort of counterexamples that will teach you not to do those things again if you have a bunch of data outside of the policies and you just want to synthesize the best possible pulse from that data. so you really have to deal with the challenges of making these counterfactuals, what's the policy?

Yes, a policy is a model or some kind of function that relates from observations of the world to actions, which is why in reinforcement learning we often refer to the current configuration. of the world as a state, so we say that the state encompasses everything that is needed to fully define where the world is right now, and depending on how we formulate the problem, we could say that you can see the state or get to see an observation which is a snapshot of some part of the state, so politics simply includes everything in it to be able to act in this world, yes, and then what does it mean outside of politics?

If so, the terms on-policy and off-policy refer to how you get your data, so if you get your data from someone else who was doing other things, maybe you get your data from some manually programmed system that, as you know it was running in the world before, that's known as data outside the policy, but if you got the data by acting in the world based on what your current policy considers good. We call that policy data, and obviously policy data is more useful to you because if your current policy makes some bad decisions, you'll see it. those decisions are bad regarding policy data, however they could be much easier to get because maybe that's all the registration data you have before, so we talked about new offline autonomous vehicles so you can imagine policy type approaches in robotics phases where there are actually a lot of robots out there, but they don't have the luxury of being able to explore based on a reinforcement learning framework, so how do we ask again an open question, but How do we make our policy methods work?

Yes, this is something that has been kind. of a big problem open for a while and inThe last few years people have made some progress on that, you know, I can tell you and it's by no means solved yet, but I can tell you some of the things that for example that we've done to try to address some of the challenges, it turns out that A really big challenge with off-policy reinforcement learning is that you can't really trust your models to give accurate predictions for any possible action, so if I've never tried it if in my data it said I never saw anyone taking the car from the road to the sidewalk, my value function or my model is probably not going to predict correctly if I ask what would happen if I drove the car off the sidewalk. way to the curb, so one of the important things that you need to do for Paul crl to work is to have the ability to determine whether a given action will result in a reliable prediction or not, and you can use a type of distribution estimation. methods kind of density estimation methods to try to figure that out so you can figure it out okay, this action my model tells me is cool, but it looks totally different from any action I've taken before, so I'm sure it's probably not is correct and you can incorporate regularization terms into your learning objective that will essentially tell you not to ask those questions that your model can't answer, which would lead to advances in this space.

Do you think a data set question is what is needed? We need to collect large reference data sets that allow us to explore space. Is it about new types of methodologies like what's the point or maybe meeting in a robotics space and defining the problem to work on it? I think four policies are reinforced. Tethering in particular is largely a matter of algorithms right now and you know this is something that I think is great because now the question is you know that just requires some very smart people to get together and think about it very carefully, whereas if it was like a data problem or a hardware problem that would require a lot of engineering, that's why I'm very excited about that problem because I think we're in a position where we can make real progress just by finding the right solution. algorithms in terms of what algorithms might be, you know the core problems are very much related to problems where you know things like causal inference, because well, you're really dealing with situations where you have a model, a statistical model that is trying to make predictions about things you haven't seen before and if it's a model that generalizes properly, that will make good predictions if it's a model that detects spurious correlations that won't generalize properly and then you can If you have an arsenal of tools that you can use, You could, for example, find out which regions it is reliable in, or on the other hand, you could try to make it generalize better in some way or some combination of the two.

Is there room to mix some or most? Of this, like 90 95 percent is outside the policy, you already have the data set and then you can send the robot to do a little exploration, like what is the function of mixing them, yes, absolutely. I think this is something that actually you could describe very well at the beginning of our discussion when you talk about the iceberg in this way, it is the iceberg that 99% of your previous experience is your iceberg and you would use it for all policies. reinforcement learning and then of course if you've never opened that particular type of door with that particular lock before then you have to go out and play with it a little bit and that's that extra 1% to help you figure out a new task and I think it's actually a pretty good recipe going forward.

Is this the most exciting space of reinforcement learning for you right now? idealized question but the beautiful idea or concept in reinforcement learning in general. In fact, I think one of the things that is a very beautiful idea in reinforcement learning is just the idea that you can get a near-optimal controller in your optimal policy without actually having a complete model of the world, this is something that maybe It may seem like an obvious thing if you just hear the term reinforcement learning or think about trial and error learning, but from a controls perspective it's a very strange thing because classically you know, we think of engineering systems and the control of engineering systems as the problem of writing some equations and then solving them, you know, I basically solve for X, find out what maximizes its performance and reinforcement learning theory actually. gives us a framework of mathematical principles only thinks about reasoning about how to optimize some quantity when in reality you don't know the equations that govern that system and that doesn't seem like something very elegant to me, not something that becomes immediately obvious, when less in the mathematical sense, does it make sense to you that it works well?

I think it makes sense when you take some time to think about it, but it's a little surprising, so, then, you take a step. in the deeper representations, which is also very surprising because of the richness of the state space, the space of environments in which this type of approach can operate, could you say what deep reinforcement learning is? Well, deep reinforcement learning simply refers to taking reinforcement learning. algorithms and combine them with high-capacity neural network representations, which, as you know, at first might seem like a pretty arbitrary thing, just take these two components and put them together, but the reason is that it's something that's become so important in recent years. years. is that reinforcement learning faces an exacerbated version of a problem that many other machine learnings have also faced, so if we go back to the early 2000s or late 90s, we'll see a lot of machine research. learning methods that have some very attractive mathematical properties, such as reducing convex optimization problems, but they require very special inputs, they require a representation of the input that is clean in some way, such as clean in the sense that classes in their multi-class classification problems are separated linearly, so they have some cases, it's some kind of good representation.

We call this feature representation and for a long time people were very concerned about features in the world of supervised learning because someone had to build those features. You couldn't just take an image and plug it into your logistic regression or your SVM or something, someone had to take that image and process it using some handwritten code and then neural networks came along and were able to learn the features and suddenly we could apply the learning directly to the raw inputs, which was great for images, but it was even more great for all the other fields where people hadn't found good features yet and one of those fields actually reinforced my learning because in learning by I reinforce the notion of features if you don't use neural networks and you have to design your own features, it is very, very opaque, it is very difficult to imagine, let's say I am playing chess or what is a feature with which I can represent the value function to go or even though the optimal policy gives up linearly.

I don't even know how to begin to think about it and people tried all kinds of things that would write, you know, an expert chess player looks to see if the knight is in the middle of the board or not, so that's a characteristic that's at night in the middle of the board and they wrote these like long lists of arbitrarily made up things and that didn't really lead us anywhere and that's a bit of chess, it's a bit more accessible. that the robotics problem, absolutely fine, there are at least experts in the different characteristics of chess, but I still like the neural network that I made myself, I mean, you put it eloquently and almost made it seem like a natural step to add neural networks, but The fact that neural networks can discover features in the control problem is very interesting, it is hopeful.

I'm not sure what to think about it, but it's encouraging that the control problem has features to learn. I guess my question is: Are you surprised at how much the deep side of deep reinforcement learning can enjoy what the problem space has been able to address, especially in games with Alpha star and Alpha zero and just the power representation there? and in the robot. space and what is your sense of the limits of this power of representation and the context of control. I think with respect to limits here, I think one thing that makes it a little difficult to fully answer this question is because in environments where I like to push these things to the limit, we find other bottlenecks, so the reason why That I can't make my robot learn to wash dishes in the kitchen is not because its neural network is not large.

It's enough that when you try to learn by trial and error, you reinforce them as a solitaire directly in the real world, where you have the potential to put together these large sets of very varied and complex data, and you start to run into other problems as one. problem that you encounter very quickly, at first it will seem like a very pragmatic problem that actually turns out to be a quite profound scientific problem. Take the robot you have in your kitchen and have it try to learn how to wash dishes through trial and error. Break all your dishes and then we won't have any more dishes to clean.

Now you might think that this is a very practical problem, but there is something to this and that is that if you have a person trying to do this, you know that person will have some degree. out of common sense they will break a plate they will be a little more careful with the next one and if they break them all they will go look for more or something like that so there are all kinds of scaffolding that come very naturally to us for our learning process, as you know, yes I have to learn something through trial and error, I have the common sense to know that I have to try it several times, if I make a mistake about something, I ask for help or receive things or something like that and all of that is outside the classical formulation of the reinforcement problem , there are things that can also be classified as scaffolding, but they are very important, like where do you get your reward function from if I want to learn how to pour a cup of water, well how do I know if I've done it correctly?

That probably requires building an entire computer vision system just to determine that and that seems a little inelegant, so there are all sorts of things like this starting to pop up. They come up when we think about what we really need to make reinforcement learning happen at real-world scale, and a lot of these things actually suggest a slight deficiency in the problem formulation and some deeper questions we need to answer. figuring that out is really interesting. I thought I liked David Silver bought alpha zero and it seems that there is no again, we have not reached the limit at all in the context when there are no broken dishes, so in the game in the case of going.

Is it really just about scaling computing? So again, the bottleneck is the amount of money you're willing to invest in computing and then maybe the different scaffolding around how hard computing is to scale, maybe, but there's no limit and it's interesting now. We move to the real world and there are the broken dishes, they solved it and the reward feature like you mentioned, that's really good, how do we move forward there? Do you think there's this kind of sample efficiency question that people ask or you know? not having to break a hundred thousand plates is this an algorithm question is this selection of data as a question or what do you think, how do we not break too many plates?

Yeah, well, one way we can think about that is that maybe we need to be better at reusing our data construct than that iceberg, so maybe it's too much to hope that you can have a machine that, isolated in a vacuum You, without anything else, can master complex tasks in minutes like people do, but maybe you don't have to either. Maybe what you really need is to have a life-long existence where you do a lot of things and the previous things you've done prepare you for do more new things and you know the study of these types of questions.

It usually falls into categories like multitasking learning or meta-learning, but they all fundamentally deal with the same general theme, which is using experience doing other things to learn how to do new things efficiently and quickly. So what do you think about if you only look at one in particular? Tesla autopilot case study that has rapidly approached a million vehicles on the road where a percentage of the time, thirty and forty percent of the time is driven using Hydra net multitasking computer vision and then the other percentage like this That's what they call it Hydra net. the other percentage is controlled by humans from the human side, how can we use that data?

What is your meaning? What is the sign? Do you have ideas in this autonomous vehicle space?When can people lose their lives? You know it's a safety-critical environment, so how? Do we use that data? So I think actually the kind of problems that arise when we want systems that are reliable and that can understand the limits of their capabilities are actually very similar to the kind of problems that arise when we have - we're doing non-policy reinforcement learning, so , as I mentioned before, and without policy reinforcement learning, the big problem is that you need to know when you can trust your model's predictions because if you are trying to evaluate some pattern of behavior for which your model does not give you an accurate prediction , then you shouldn't use it to modify your policy and it's actually very similar to the problem we face when we implement that thing and we want to decide whether to trust it right now or not, so maybe we just need to do a better job of figuring out that part and That's a very deep research question, of course, it's also a question that a lot of people are working on, so I'm pretty optimistic that we can make progress on this in the next few years.

What is the role of simulation in reinforcement learning? In the end, the learning of deeper application. Reinforcement learning. How essential is it? It has been essential to the progress so far. For some interesting developments. You think it is? a crutch that we rely on, again, can derail policy debate, but do you think we will ever be able to get rid of simulation or do you think simulation will really take over and create increasingly realistic simulations that will allow us to solve real situations? Real-world problems, such as transfer, models will learn in simulation from the walkthrough. Yes, I think simulation is a very pragmatic tool that we can use to make a lot of useful things work right now, but I think that's in the long term.

We will need to build machines that can learn from real data because that is the only way to get them to perpetually improve because if we can't get our machines to learn from real data, if they have to rely on simulated data, eventually the simulator becomes the In fact. , this is a general thing: if your machine has some human-made bottleneck that doesn't improve from data, it will eventually be the thing holding you back, and if you completely trust your simulator, that will fix it. be the bottleneck if you really depend completely on a manually designed controller, that will be the bottleneck, so simulation is very useful, it is very pragmatic, but it is not a substitute for the possibility of using real experience and, by the way, this is something that I think is quite relevant now, especially in the context of some of the things that we've discussed because some of these types of scaffolding problems that I mentioned, like broken dishes and unknown reward function like these, are not problems you would ever face.

We stumble when working in a purely simulated type of environment, but they become very evident when we try to run these things in the real world. Could you give a brief spin to our discussion? Let me ask you, do you think we are living in a simulation? oh, I have no idea if you think that's a useful thing to even think about the fundamental physical nature of reality or another perspective. The reason I think the simulation hypothesis is interesting is to think about how difficult it is to create a species. of a virtual reality game type situation that will be compelling enough for us humans or pleasant enough that we wouldn't want to leave, it's actually practical engineering and I personally enjoy virtual reality a lot, but it's quite far away, but I think on what it would take to want to spend more. time in virtual reality versus the real world and that's kind of a nice clear question because at that point we've gotten to do I want to live in a virtual reality, that means we're just a few years away where most of the population lives in a virtual reality and this is how we create the simulation.

You don't need to really simulate, you already know quantum gravity and just every aspect of the universe, and that's reading that the interesting question for reinforcement learning is as well. If you want to make realistic enough simulations that blend the difference between the real world and the simulation, some of the things we've been talking about will go away if we can create really interesting things. rich simulations is an interesting question and in fact I think your question sheds a very interesting light on your previous questions because in some ways asking if we can make a more practical and practical version is like knowing, can we build simulators that be good? enough to essentially train AI systems that will function in the world and it's interesting to think about this about what this implies, if true, it implies that it is easier to create the universe than it is to create a brain and then it seems that Put this way, it seems bit strange.

The aspect of simulation that interests me most is the simulation of other humans, which seems to be a complexity that makes the robotics problem more difficult now. I don't know if everyone in robotics agrees with that notion. As a quick aside, what are your thoughts about when the human enters the robotics problem picture, how does that change the reinforcement learning problem and the learning problem in general? Yeah, I think it's a complex question and I guess my hope for a while has been that if we build these robotic learning systems that are multitasking, that use a lot of prior data, and that learn from their own experience, the part where that they have to interact with people may be handled the same way as all the other parties, so if they have previous experience in attracting people and can learn from their own experience of interacting with people for this new task, perhaps that is enough now, of course, if it's not enough, there are many other things we can do. do and there is quite a bit of research on that in that area, but I think it's worth trying to see if multi-agent interaction is the ability to understand that other beings in the world have their own goals, tensions and thoughts, etc. about whether that kind of understanding can emerge automatically just by learning to do things and maximize utility that information emerges from the data you said something about gravity that you don't need to explicitly inject anything into the system that can be learned from the data and gravity is an example of something that could be learned from data, kind of like the physics of the world, like what are the limits of what we can learn from data, do you really think we can do it?

A clear way to ask is: do you really think we can learn gravity from data alone? The idea of the laws of gravity says something that I think is kind of a common mistake when thinking about prior knowledge and learning is assuming that just because we know something, then it's better to tell the Machine than to have it. I regret it alone. In many cases, the things that are important and that affect many of the events that the Machine will experience are actually quite easy to learn, as you already know. If things, if every time you drop something, it falls, yeah, you might not get what you know, you might get something like Newton's version, not Einstein's version, but it'll be pretty good. and it will probably be enough for you to act. rationally in the world because you see phenomena all the time, so things that are easily evident from the data we might not need to specify by hand, it might actually be easier to let the Machine figure it out, it just happens.

It feels like there might be a space. of many local minima in terms of theories of this world that we would discover and in which we would stay stuck. Yes, of course, Newtonian mechanics is not necessarily easy to achieve. Yes, and well, in fact, in some fields of science, for example, human civilizations are full of these local optima, so, for example, if you think about how people try to discover biology and medicine, you know from long ago the kind of rules, like the kind of principles that serve us very well in our daily lives, actually serve us very poorly.

In understanding medicine and biology, we had very superstitious and strange ideas about how the body worked until the advent of the modern scientific method, so it seems to be a failure of this approach, but it is also a failure of human intelligence, possibly such time. A little aside, but some already know that the idea of self-play is fascinating, reinforcement learning is kind of like competitive and creating a competitive context in which agents can play against each other at a kind of skill level and, Therefore, they seem to increase each other's school. Being this type of self-improvement mechanism is exceptionally powerful in the context where it could be applied.

First of all, it's beautiful to you that this mechanism works as well as it does and can also be generalized to other contexts like in the robotic space or anything that is applicable to the real world, I think it's a very interesting idea and I suspect that the bottleneck for generalizing it to the robotic environment will be the same as the bottleneck for everything else we need to be able to build. machines that can get better and better through natural interaction with the world and once we can do that, then they can go out and play, they can play with each other, they can play with people, they can play with the natural environment, but before that we get there. we have all these other problems that we have, we have to get out of the way, there are no shortcuts, you have to interact well with the national environment because, in a self-play environment, you still need mediation mechanisms, so the reason why You know that self-play works in a board game is because the rules of that board game mediate the interaction between the agents, so the type of intelligent behavior that will emerge depends largely on the nature of that mediating mechanism, so than on the reward side. functions that arise with a good reward function seems to be what we associate with general Intel, just as humans seem to value the idea of developing our own reward functions, you know, coming up with meaning, etc., and yet , for reinforcement learning we often use to specify, that's what you think about how we develop a reward for good.

You know, the good reward features. Yes, I think it's a very complicated and very deep question, and you're absolutely right that classically, in reinforcement learning, this question has been addressed. it's not a problem that you treat reward as something external that comes from some other part of your biology and you can't worry about that and I think actually that's a small mistake that we shouldn't worry about that and we can approach you in different ways . We can approach it, for example, by thinking of rewards as a means of communication. We can say well, how does a person communicate with a robot?

What's your objective? It is also a kind of means of intrinsic motivation. You could say: can we write a kind of general objective that leads to a good capacity? With that goal, you will learn useful things. This is something that's sometimes called unsupervised reinforcement learning, which I think is a really fascinating area of research, especially today, we've done a little bit of work on that one of the things that recently what I've studied is whether we can have any notion of unsupervised reinforcement learning using theoretical amounts of information such as minimizing a Bayesian measure of surprise.

This is an idea that, as you know, was pioneered in the computational neuroscience community by people like Carl Fritton, we've done some work recently that shows that you can actually learn some pretty interesting skills by essentially behaving in a way that allows you to make accurate predictions about the world. It seems a bit circular. Do the things that will lead you to the correct answer for the prediction, but know that by doing this you can discover stable niches in the world. You may find that if you are playing Tetris, then you know correctly that clearing rows will allow you to play Tetris longer and keep the board. nice and clean, that somehow satisfies some desire for order in the world and, as a result, gain some degree of influence over its domain, so we are quite actively exploring whether there is a role for the human notion that Curiosity itself is the reward.

It's kind of a discovery of new things about war in the world, so one of the things I'm very interested in is whether discovering new things can be an emergent property of some other goal that quantifies the ability to do new things for the sake of it. good of new things. maybe it's not, maybe it's not the right answer on its own, but maybe we can find a goal for which discovering new things is inactually the natural consequence, that's something we're working on right now, but I don't have a clear answer for you. there is still work in progress, you mean just as a safety observation to creatively look at patterns of curiosity in how to optimize for a particular protector in how to optimize for a particular measure of capability, are there ways to understand o anticipate unexpected unintended consequences of particular reward functions anticipate the type of strategies that might develop and try to avoid highly detrimental strategies.

Yes, classically this is something that has been quite difficult in reinforcement learning because it is difficult for a designer to have a good intuition about knowing what learning result will be obtained when given some objective. There are ways to mitigate it. One way to mitigate this is to set a goal that says, "Don't do weird things; you can actually quantify yourself." I can say just don't get into situations that have low probability based on the state distribution you've seen before, it turns out that's actually a very good way to avoid policy reinforcement learning, so we can do some things like that if little soon venture to talk about reward functions towards increasingly higher levels of intelligence, there is a gentleman.

Russell thinks about this: the alignment of AI systems with us humans, so how can we ensure that AI AG systems align with us humans? It's a kind of reward function: specifying the behavior of AI systems so that their success aligns with us humans' broader Success Interest. Do you have any thoughts on this? Do you have some kind of concern about where reinforcement learning fits into this or is it really focused on the current moment where we're pretty far away and trying to solve the robotics problem? I don't have a great answer for this, but you know and I think this is a problem that is important to solve.

For my part, I'm actually a little more worried about the other side of this equation, which, perhaps, More than unintended consequences for goals that are too well specified, I'm actually more worried right now about unintended consequences for targets that are not well enough optimized, which could become a very pressing problem when, for example, we try to use these techniques for safety-critical systems like cars, airplanes, etc. I think at some point we will face the problem of goals being optimized too well, but right now I think it is more likely that we will face the problem of goals not being optimized well enough, but don't think so.

Intended consequences can arise even when you are far from the optimal state, sort of on the way to it, oh no, I think absolutely unattended consequences can arise, it's just that I think right now the bottleneck to improving reliability, security and things like that is more with The systems they like need to work better to better optimize their objective. You have concerns about human-level existential threats. AI Systems I think there are absolutely existential threats to AI systems, just as there are to any powerful technology, but I think these types of problems can take many forms and some of those forms will come down to people with nefarious intentions. of them will come down to AI systems that have some fatal flaws and some of them will, of course, come down to AI systems that are overly capable in some ways, but among this set of potential concerns, I would actually be much more concerned about the first two at this moment and mainly the one with nefarious humans because you know throughout the entire history of humanity, actress, that I, the humans of Ferris, have been the problem, not the nefarious machines, so I mean the others and I think that at this moment it is the best I can.

What I can do to make sure everything goes well is to build the best technology I can and hopefully also promote responsible use of that technology. Do you think RL systems have anything to teach us humans, you said nefarious humans get us into trouble, I mean the machine learning system has somehow revealed to us the ethical flaws in our data in that same kind of wake and reinforces some learnings, teach us about ourselves, has it taught us anything? What have you learned about yourself from trying to build robots and reinforce learning systems? I'm not sure what I've learned about myself, but perhaps part of the answer to your question may become a little more evident once we see a more widespread implementation. of reinforcement learning for decision support in domains like health education, social media, etc., and I think we'll see some interesting things emerge there.

We will see, for example, what types of behaviors arise with these systems in situations where there is interaction with humans and where they have the possibility of influencing human behavior. I think we're not there yet, but maybe in the next couple of years we'll see some interesting things emerging in that area, hopefully outside of research because the interesting space where you could look at this is large companies that they deal with large amounts of data and I hope there is some transparency and one of the things that is not clear when I look at social media and just online is why an algorithm did something or if I know there was even an algorithm involved and that It would be interesting as a formal research perspective to simply observe the results of the algorithms to open that data or were they transparent enough about the behavior of these e-a systems in the real world.

What is its meaning? I don't know if you looked at the bitter lesson from Irish Sutton's blog where he discusses the big lesson from AI research on reinforcement learning is that simple methods, general methods that leverage computation seem to work well, so basically they don't. . Try to do any kind of fancy algorithm, just wait for the calculation and be quick. Do you share this type of intuition? I think the high-level idea makes a lot of sense. I'm not sure my conclusion is that it's not necessary. working on algorithms I think my conclusion would be that we should work on general algorithms and in fact I think this idea of needing to better automate the acquisition of real-world experience actually follows quite naturally from Rich Sutton's conclusion, so if the statement is If automated general methods plus data lead to good results, then it makes sense that we should create general methods and we should create the kind of methods that we can implement and have them go out and collect their experience autonomously.

I think you already know. One place where I think the current state of affairs is a little short of that is actually going out and collecting data autonomously, which is easy to do in a simulation board game, but very difficult to do. do in the real world, yes. It keeps coming back to this problem, so your mind is focused there now in this real world. The step of collecting the data seems scary and it's not clear to me how we can do it effectively. Well, you know, that's seven billion people. In the world, every one of them had to do that at some point in their lives and we should take advantage of that experience that everyone has done it.

We should be able to try to collect that kind of data. Well, great questions, maybe going back in your life. Would the technical, fiction, or philosophical book(s) have a big impact on the way you saw the world? I know he was thinking about the world, your life in general hmm and maybe what books, if different, you would recommend people consider reading on their own. intellectual journey could be within reinforcement learning, but it could be much larger. I don't know if this is a particularly scientifically significant answer, but like the honest answers I found, I actually found much of Isaac Asimov's work to be very inspiring.

When I was younger, I don't know if that has anything to do with AI, you don't necessarily think it had a ripple effect on your life, maybe it did, but yeah, I like it, I think a vision of a future is what first. Of all the artificial mice, artificial intelligence systems, artificial robotic systems, do you know an important place, an important role in society and where we try to imagine the kind of edge case of technological advancement and how that could play out in our future? story, but yeah, I think that was in some way influential, I don't really know how, but I would recommend it.

I mean, if nothing else, you'd be well entertained. Would you first fall in love with the idea of the artificial? Intelligence gets captivated by this field, so my honest answer here is that I really just started thinking about it as something I might want to do in grad school pretty lightly and a big part of that was until you know. At some point around 2009 2010 just wasn't a priority on my priority list because I didn't think it was something where we were going to see very substantial progress in my life and you know, maybe in terms of my career, the moment when I really decided I wanted to work on this when I took a seminar course taught by Professor and Ring and at that point, of course, I had a decent understanding of the technical aspects involved, but one of the things.

What really resonated with me was when he said in the keynote something like he used to have graduate students come to him and talk to him about how they wanted to work in AI and he would chuckle and give them some math. problem to address, but now he's actually thinking that this is an area where we could see substantial progress in our lives and that got me thinking because you know it's an abstract sense, yeah, as you can imagine, no, but in a It makes a lot of sense when someone who had been working on that kind of thing their entire career suddenly says yes, that had some effect on me, yes, this could be a special moment in the history of the field, this is where we could see something.

Some interesting developments, so in the advice space, someone who is interested in getting started and into machine learning or reinforcement learning, what advice would you give to maybe a college student or maybe even younger? What are the first steps to follow and, later, what are the steps? steps to take on that journey, so I think something that's important to do is not be afraid to spend time imagining the type of outcome that you would like to see to know that one outcome could be a great successful professional salary or something or results of last generation at some benchmark, but I hope that's not the main driving force for someone, but I think if someone who is a student considering a career in AI takes some time to sit down and think.

Like what do I really want to see, what do I want to see a machine do what I want, what do I want to see a robot do what I want to do and what do I want to see a natural language system, just like imagine, you know, imagine it. almost like a commercial for a future product or something, or something that you would like to see in the world and then sit down and think about the steps it takes to get there and hopefully that's not a better number. Imagenet classification is like it's a real thing that we can't do today that would be really cool, whether it's a Butler robot or, you know, a really cool healthcare decision support system, whatever that is.

You'll find it inspiring and I think thinking about that and then going back from there and imagining the steps needed to get there will actually make for much better research, lead to rethinking assumptions, lead to working on bottlenecks where others people are not working and then naturally To address you, we have talked about reward functions and you just give advice and I look forward to it. I'd like to see what kind of change you would like to make in the world. What do you think is a big ridiculous question? What do you think is the meaning of life, what is the meaning of your life, what gives you fulfillment, purpose, happiness and meaning, that's a very important question, what is the reward function that you are operating under?

Yes, I think one thing that gives you, if not meaning, then at least satisfaction, is a certain degree of confidence that I'm working on a problem that really matters. I feel like it's less important for me to solve a problem, but it's really nice to spend my time on things that I think really matter and I try. quite difficult to search, I don't know if it's easy to answer this, but if you are successful, what does that look like? How do they dream? Of course, success is built on success and you go on forever, but what? it's the dream, yeah, so a very concrete thing or maybe as concrete as it's going to be here is seeing machines that actually get better and better the longer they exist in the world and that seems like on the surface one might even think that's something we have today, but I thinknot really.

I believe there is infinite complexity in the universe and to date all the machines we have been able to build do not improve to that point. At the limit of that complexity, they hit a wall somewhere, maybe they hit a wall because they are in a simulator that has a very pale and very limited imitation of the real world or they hit a wall because they depend on a data set of tags, but they never hit the wall of running out of things to see like they did, so you know, I would like to build a machine that can go as far as possible and hit the ceiling of the complexity of the universe, yeah.

Well, I don't think there's a better way to end it Sergey, thank you very much, it's a great honor. I can't wait to see the amazing work you have to publish and in the educational space in terms of reinforcement learning, thanks for inspiring the world thanks for the great research you do thanks for listening to this conversation with Sergey

levine

and thanks to our cash sponsors app and expressvpn. Please consider supporting this podcast by downloading the cash app and using the podcast code lex and signing up for expressvpn comm. /lex pod click all the links buy all the stuff is the best way to support this podcast and the journey I'm on if you like this subscribe on YouTube review it five stars on a patreon supported podcast or connect with me on Twitter at lex Friedman is spelled somehow if you can figure out how without using the letter e just FR ID ma m and now let me leave you with some words from Salvador Dali intelligence without ambition is a bird without wings thanks for listening and I hope to see you next time

Watch Video & Subscribe

If you have any copyright issue, please Contact