YTread Logo
YTread Logo

How Meta’s Chief AI Scientist Believes We’ll Get To Autonomous AI Models

May 05, 2024
thank you Yan and welcome and my goodness, thank you for this, the highlight of my year, the opportunity to talk to you. I don't know what you can see right now, but there's 2,000 of the smartest people on the planet looking at you from Cambridge and uh boy, what an opportunity to pick your brain, it's in stereo, look at that one, well, I can see them from the back, okay, yeah , actually, if you want Yan to see your face, he is also behind you, uh, then Yan, um, how amazing. coincidence llama 3 dropped just as we were meeting today what are the odds incredible although absolutely amazing so what came out today was 8B llama 3 8 billion and 70b uh so far what we're hearing in the rumor mill is that the 8B works like Well, like The old flame 270b did, we are seeing an order of magnitude change.
how meta s chief ai scientist believes we ll get to autonomous ai models
Do you think it's OK? I also noticed that it was trained on 15 billion parameters, where did it come from? or 15 billion tokens, where did it come from? 15 billion tokens, okay, so the first thing I have to say is that I don't deserve any credit for Lama 3. Maybe a little credit for making sure our

models

are open source, but the technical contributions. uh, you know, they're from a very large collection of people and they have a very, very small share, so 15 billion tokens. Yeah, I mean, you need to get all the data, you can get all the high-quality public data and then. you know, fine-tune and, you know, the license data and everything, that's how you get to 15 billion, but that's a little, you know, it's a little saturated, like there's only a limited amount of text that you can get and that's it, well, I have to Di, I owe you a huge fraction of my life's journey, you didn't know, but when you were doing optical character recognition back in the day, I was reading your CNN articles, he invented neural networks convolutionals that actually did those things.
how meta s chief ai scientist believes we ll get to autonomous ai models

More Interesting Facts About,

how meta s chief ai scientist believes we ll get to autonomous ai models...

The job that became my first dollar of income at a startup was making neural networks based on his work. It changed the course of my life. Now they are doing it again, especially for you young people here in the front, being the champion. Open Source I think you're fundamentally giving them the opportunity to build companies that otherwise couldn't be built, so you owe a huge debt of gratitude for standing up for that in the first place, so the next thing that happens could be one of those. Events that we remember in history and say were a turning point for humanity.
how meta s chief ai scientist believes we ll get to autonomous ai models
Monster Neural Network 750b coming out soon will also be open source. I guess four or 5D from what I got, about 400 million, about 4 million trillion. billion, yeah, okay, yeah, yeah, dense, not sparse, which is interesting, so, yeah, it's still training. You know, despite all the computers we have in our hands, it still takes a long time, it takes a long time to get right. tune in, yes, but there will be a lot of those coming out, you know, variations of those

models

will come out in the next few months. Yeah, I was going to ask that question next so they wouldn't come out at the same time. interesting, which means you still have to be in the training process, it's a huge effort and I saw on the news that Facebook had bought another 500,000 Nvidia chips, bringing the total to around a million by my calculations, unless that you get a discount that you could have.
how meta s chief ai scientist believes we ll get to autonomous ai models
I got a volume discount, but that's $30 billion dollars worth of chips, which would make this model training larger than the Apollo lunar mission in terms of research and development. Am I understanding correctly? It's amazing, isn't it? I mean a lot, you know, not only training but also deployment is limited by competitive skills. I think you know one of the problems that we're facing, of course, is the supply of GPUs, which is one of them and the and the cost of them right now, but another one is actually scaling up the learning algorithm so that they can be parallelized on many GPUs and progress on this has been a bit slow as in the community.
So I think we're expecting advancements there, but we're also expecting other advancements that you already know in terms of architectures, like new principles, new, new, like completely new blueprints with which to build AI systems that allow them to do things that they can do. today MH and since you mentioned the philosophy of taking an investment of that size and then opening it up, there's no historical precedent for this and you know the equivalent would be knowing if you build a gigafactory that builds Tech um Teslas and you somehow gave it to society, but the thing is that once you open source it can be copied infinitely, so it's not even a good analogy to talk about an open source gigafactory, so there's no precedent for this in the history of business, what is the logic behind making it open source?
What do you want to happen with this? So what happened? DNA of the

meta

, you know Facebook before, from the beginning there are many open source packages that are basically infrastructure software that

meta

has been open source over the years, including AI, so everyone uses PCH, well , all except a few. people who googled it, but almost everyone agrees and that's open source, it was originally built on Meta Meta actually transferred ownership of borch to the Linux Foundation, so it could be much more of a kind of effort community. So that's really the end of the business and the reason is you know infrastructure is better and improves faster when it's open source when more people contribute to it when there are more eyes looking at it, it's also more secure yeah.
What's true for you, you know, Internet infrastructure software is also true for AI and then there's the additional thing for AI, which is that fit models are very expensive to train, it would be a complete waste of resources for you. I know that having 50 different entities is training its own parish model. I mean, it's much better if there are only a few, but they make them open and that basically creates the substrate for a whole ecosystem to take off and it's pretty much the same thing. That happened to the Internet in the 90s, if you remember, in the mid 90s, when the Internet started to become very popular, the software infrastructure was dominated by proprietary Microsoft platforms or some microsystems and both lost, disappeared. that market now is all Linux Apache uh you know MySQL PHP whatever you know all the open source stuff, even the core of web browsers is open source, even the software stack of cell phones, cell towers cellular are open source today, so the infrastructure needs to be open source it just makes it progress faster be more secure and all good.
I'm really glad to hear you say that because there are definitely divergent philosophies on this if you think about where open AI is going and where and where you are. Come on, but the version of the world you're describing is one where all these startups and all these teams can thrive and be competitive and create and innovate, and the alternative version is one where strong AI is invented in a box. . and it is controlled by a very small group of people and all the benefits that you know confer to a very small group, so I have nothing to do with this, but I certainly love your version of the future, more than alternative versions, so I I'm really glad to hear you say it, so I want to spend a lot of our time, or the limited time we have, talking about the implications of this and where you think it's going.
I also want to ask you about V JEA, um. So you've been very clear in saying that llms will take us down a path of amazing things we can build, but it won't take you to a truly intelligent system. You need experience in the world and I think V JEA is your solution. that's what's going to get us to that goal, tell us about vppa first, well first of all, I have to tell you where I think AI research is going and I wrote a pretty long guy. Vision document on this about two years ago that I put online that you can, you can, you can search, it's an open review, it's called a pass towards

autonomous

machine intelligence.
Now I replace self-employed with Advan because people are afraid of the word self-employed. So we have this

autonomous

thing or advanced machine intelligence that is spelled Ami and in French your name is, you know, you pronounce this am, which means French m in French, which I think is a good analogy, you know, anyway, um , the current llms are very limited, uh in their abilities and, you know, Stepen W from just before also pointed out those limitations, one of them is that they don't understand the world, they don't understand the physical world, the second is that they have no memory persistent, the third is that they can't really reason in the sense that we normally understand reasoning, they can regurgitate previous reasoning that they have been trained in, but and adapt it to the situation, but they don't actually reason in the sense that we understand it. for humans and many animals and the last thing that is also important is that they can't really plan, they also can't go back to rejecting the plans that they have been trained in, but really plan in new situations that they can't and there are many from studies that you know from various people showing the limitations of movies for planning reasoning and, uh, uh, understanding the world, etc., so basically we need to design new architectures that would be very different from the ones we currently have and that will make systems of AI understand the The world has a plan of reason, memory, and it is also controllable, in a way that you can give them objectives and the only thing they can do is meet those objectives and do nothing else, be subject to some guardrails, so that that's what would make them. safe and controllable too, so the missing parts are how we get the AI ​​system to understand the world by observing it a bit like baby animals and humans.
You know, it takes a long time for BB humans to really understand how the world works. I like the idea of ​​the fact that an unsupported object falls due to gravity. Human babies take nine months to learn this, yes, it's not something you're born with, it's something you have to observe the world and understand. How do we manage to reproduce disability with machines? For almost 10 years now, my colleagues and I have been trying to train a system to basically produce videos with the idea that if you get a system to predict what is going to happen in a video it has to develop some understanding of the nature of the world. physical and it's been basically a complete failure and we tried a lot of things for many years, but a few years ago what I realized is that the architectures that we can use to train deep learning systems to learn image representations are not generative, they are not are things for which you know, you take an image, you corrupt it and then you train a system to reconstruct the image, um, not corrupted and, of course, what is the way we train films, this is how we train films, we take a text fragment, we remove some of the words and train a gigantic neural network to predict the missing words if you do this. with images or videos it doesn't work or it kind of works, but you get representations of images and videos that are not very good and the reason is that it is very difficult to reconstruct all the details of an image or a video. hidden from you and so what we discovered a few years ago is that the way to address that problem is through what we call a co-embedding architecture or a predictive co-embedding architecture, which is what JEA means is an acronym um and the idea of ​​articulation The Ting architecture dates back to the early '90s.
Some people worked on it, we used to call them Nets, but the idea is basically, if you have, let's say, a video fragment and you mask out some parts, let's say the second half of the video. and then you train a large network to try to predict what will happen next in the video. That would be a generative model. Instead, we run both videos through encoders and then we train a predictor in the representation space to predict the representation of the video not all the pixels of the video and uh and you train everything simultaneously, we didn't know how to do this four years ago years and we figured out several ways to do it when we now have half a dozen algorithms for this, so VJ is a particular example of this kind of thing and the results are very promising.
I think ultimately we'll be able to build or train systems that basically have mental models of the world, you know. have some notion of intuitive physics have some possibility of predicting what is going to happen in the world as a result of performing an action, for example, and if you have such a model of the world, then you can plan, you can plan a sequence of actions to get to a particular goal and that's what intelligence is really about, that's what we can do, psychology, a really critical question, actually, because you know when, when you use diffusion algorithms to create images, you know thatThey will form six or four fingers.
They never do five fingers all the time, but these movies have a surprising amount of common sense, but they also lack a surprising amount of common sense. Once you put in the JEA data and the V JEA data, you give him a lot more opportunity to think a lot more like us because all the real-world experiences of moving and feeling things are built into the training data, so do you think that the result will be a massive Foundation model or will we continue to use the approach of combining experts and putting them together in sort of synthetic forms.
I think ultimately it will probably be a great model, of course, it will be modular in the sense that, already You know, there will be multiple modules that will interact but not necessarily be completely connected to each other. Uh, there's big Debate now that you know in AI if you want a multimodal system that deals with both text and images and videos, you should do Fusion early, so should you basically tokenize images or videos and then convert them into sort of little vectors that can you use? concatenate with the Tex tokens or should you do late Fusion, which means you know run your images or videos through some kind of encoder that is more or less specialized for it and then you know you have some fusion on top. more in favor of the second approach, um, but a lot of the current approaches are actually earlier than Fusion because it's easier, it's simpler.
I'm going to do the dangerous thing, the dangerous thing of asking you to predict the future, but if you can't, then no one can, then it has to be you, um, so once you input the v JEA data and train these massive models and assume that you go up another 10 times, you know, buy another $30 billion or so, of chips, the combination of the v jepa data plus this massive scale will be enough to then solve fundamental problems like physics problems and biological experimentation problems or We are still missing something along the way that needs to be thought about and added.
After that it is clear that we are missing a number of things, the problem is that we don't know exactly what and we can see the first obstacle, really, but, but where is that going? Afterwards it's not clear, but the hope is that we get systems that can have some level of common sense. At first you know that they won't be as smart as the best mathematicians or physicists, but they will be. as smart as your cat, that would be a good advance, you know, a pretty good advance, okay if we had systems that, as you know, could understand the world like cats, if you had systems that could be trained very easily, uh in 10 minutes like any 10 year old to clean the table and fill the dishwasher, we would have home robots if we had systems that could learn to drive a car in 20 hours of practice like any 17 year old, that would be a big big advantage Hello, I'm just here for a second, take some time, just so you know, we talked at that time at the party, um, uh, in Davos, uh, about it and, uh, we enjoyed having you at Imagination action in the dome, this is the second of three of our events uh, I don't know if you realize this, but if you speak at all three, the next one is June 6th, you will get a chiaia pet, this is a foot of a chiaia pet, so yeah I think a chiaia mascot would go great there.
Did you like speaking under the dome, not the MIT dome, but the MIT event in Davos, yes, it was fun, yes, okay, I can lock you in for next year? There was a spectrum of, you know, Techno positive Optimistic uh type people and I wasn't like uh at the end of that spectrum and and on the other side of the doomers who think that the doomers are Davos, well, we have someone from Open Ai and since you work in meta, you may not want to be seen on the same Zoom, so ladies and gentlemen Yan look thank you Yan right, thank you well done, wait

If you have any copyright issue, please Contact