Meta Announces Llama 3 at Weights & Biases’ conference

May 19, 2024

Alright, welcome, welcome, welcome, please come in, so first before you jump, what do you think of my picture? It's so creepy or I actually generated it and asked for a

llama

that was holding three fingers and that's what I got um that was actually I think after my sixth message so yeah hee it's interesting it's creepy and a sometimes it's embarrassing so anyway, as Rob mentioned, I'm Joe speac, I'm with

meta announces llama 3 at weights biases conference...

This is super new information, super like it just came off the shelf. In fact, I pretended to say a lot of these slides on the Uber ride here this morning, so you're going to love them, so call, I think hopefully everyone has heard of call by now, okay , thank you, it definitely is. my baby and we have an amazing team behind this so you know I'm a small part of it and I'm very proud to be a part of that team so a little background before we jump. on fire, so we actually started a meta in Oregon, actually around February 2023, it feels like 5 years, 10 years, it feels like ancient history, but we started the organization in February 2023, basically brought together several equipment.

Across the meta, I've spent time on the AI platform, as well as Fair, Facebook AI research, or fundamental AI research, now that we're meta and basically bringing together some of the smartest and brightest people in AI across the enterprise, from CML to modeling. folks to the data for, uh, just, you know, badass researchers that do domain research, um, you know some of the things that we did, is that we actually brought in all the creative people as well, so you know if you're familiar with some of our work. like emu um, which you actually know is actually our imaginary flash that I got this from the website.

I spawned it this morning um and of course we have to have a llama so a nice white llama jumping in front of, you know, a little one. Red Barn in the rain um this was generated with a message by the way, you can do this completely free, you can go and just go to meta um and click on imagine and you can generate and just hit animate and it will do all this for you and you can downloading the video yourselves and doing all that stuff, it's great, so a little bit about llama, let's jump into the summary of where we were and what we've done so far, so you know.

Again it feels like a million years ago, we launched Llama 2 in July, so these were commercially available and commercially usable models along with, you know, basically about 100 parts from over 100 partners, in fact we launched shortly after the code Call, which was specific code models find F for code generation Um, could you chat about the code, use it for Python ET, um, adoption has been ridiculous, so we've seen so far over 170 million downloads of our models, um, in face hugging, um, I mean. about 50,000 derivative models for people to find tuned for different applications, reload, make it to the leaderboard, etc., there are over 12,000 startup projects, uh, there are startups that literally bear the name of a llama, which is just amazing, in December we launched something called Purple Llama, has anyone heard? from purple llama name F okay I'm going to dive into that because I'm disappointed with that raised hand so we'll talk more about purple llama so we launched purple llama which was our overall project for open trust and security and We know what important things are trust and security in the Gen era, so I'll talk a lot more about that, but we basically threw in things like our input and output safeguards on how to filter input from prompts, as well as what the model actually outputs and We also introduced the first open cybersecurity assessment benchmark, which is face hugging, and again, I have more to say about that, we saw a lot of adoption, so cloud providers are deploying these tools. model hugging the face and seeing how, how much risk they have in terms of uh of cyers SEC um uh generation and that's things like how useful you are to a cyber attacker which, of course, you don't want to be very useful to a cyber attacker, right, but these models can be, and then we launched another version of the Lama code in January of this year, in January, the 70b, which is a larger version that has the latest technology. and again, all of this is commercially available, we have tools, we have open source code, the license allows commercial use, you can do whatever you want, module, the acceptable use policy we have, so here's kind of the line time. um look so the original flame um in February this was licensed for research use.

There are a lot of funny stories about the original llama who was my old team doing the buffing and AI, now they are the mraw team if you are famous. mraw uh so we launched up to 65 billion perimeter model flame 2 again commercially uh usable um 7 up to 70 uh the latest at that time um I would actually say they are very high quality models um used on now startups are building on on them, uh, it's really changed the landscape and then Code Llama came in August and in January, and then of course we have Metal Lama 3oo, yeah, so Metal 3, this was an absolute labor of love if you keep going. model card on GitHub, you'll see how many people are contributing, and in fact, as we talk, people are saying, you forgot me, you forgot me, you forgot me, so I'm updating that model card with a lot of contributors across the company, who? played a big role in making this happen it took a village to put out llama 3 uh we released basically two basic versions so 8 billion parameters you're wondering like why seven you know why eight instead of seven.

Well, the vocabulary got bigger and you know, there are more parameters, but um becomes eight um and then we have 70 billion um and we launch both the pre-trained and the aligned based models, we call them instruction models um. and these are somewhat analogous to the previous chat models um and uh, they're all open source. We also released a lard V2 which I'll talk about as well, so let's do a little click on these models, so we train them on um. 7 times at least 7 times more data, so if you're familiar with the previous models, we trained them with about two billion tokens in pre-training, so that's over 15 billion tokens, these new models have been trained in terms of something like okay. -tune so you know if you are familiar with the workflow of generative AI models.

Your pre-workout is unsupervised, meaning it has no labels. In post-training it does things like reinforcement learning with human feedback or you know DPO, which I'll talk about um, but you also have a lot of human annotations and to give you an idea, we had 1 million annotations in sft uh in uh in llama 2, in actually 10 times more than actually a little more than 10 times more than um in Lana 3. So a lot more human, uh, labeled data, that's very, very expensive, but it generates incredible models, um, and obviously we include a vocabulary broader, a new tokenizer that doesn't sleep in our tokenizer, it's amazing, and it's really, it's so much more. efficient, much more efficient, um and also duplicates the context window, um, I want to emphasize that these models are actually very early versions of llama 3.

We were actually going to call them pre-release or preview because they didn't have all the things that we planned launch it, but Mark really wanted to launch them and call them llama 3, so we did and honestly, we were blown away at the reception and blown away by how good they are compared to other models, speaking of which, so I actually do reviews both from the instructions like from the post-rain models, but also from the base models themselves, so you can consider the base model, since you know something that will work like, for example, you will do a completion like a text to complete, um, and then you can think of the instructors like you can actually converse with them, right? much more like um aligned with a kind of use case application and you can see from the left side, these are really, you know, this is actually quite incredible, but you can look We compare the 8B models, uh, which Do you know they can be use even on mobile phones like Qualcomm?

You're quantizing them and running them on your phones with Snapdragon um and those are state of the art compared to some of the other top models like the Gemma 7B and the Minal 7B. um and buy huge margins um so um and actually one thing that's not even on the slide is that when you compare the numbers and the performance of the 8B it's actually better than the 70b model Llama 2, which is ridiculous when you think about it uh we also compared the 70b, our largest model, to the Gemini Pro 1.5 and claw 3 Sonet um and it's just about outperforming across the board, you know, module pointer in math and um, and obviously Google does something different with its prompts by Manu so you can read. all about that, but the 70b is an absolutely crazy model, an open source model, and as you can see, we also did pre-trained benchmarks, so one of the interesting things about pre-training is that not many companies really They throw them. numbers so we were very lucky to get Gemini Pro 1.0 if we read their article and saw what they actually published and obviously you can see things don't line up but you know the base models are actually across the board.

Um you know better than the 1.0 uh models and the newly released 22b mixed rum that came out last week and then the instructional version this week, such ridiculously awesome models, they'll be super fun to play with. love them please download them have fun one of the things we were really adamant about and really passionate about is you know benchmarks are fun it's cool to show like mou and GSM and this kind of stuff and it's kind of of gameplay skill that goes hand in hand with that, but the rubber meets the road when you actually put them in the hands of humans and understand how they do it and how they like these models and so what we did was have a lot of annotation partners and we scale through them and work with them basically to generate a dataset of 1800 messages uh based again on human human messages in uh 12 categories this could be coding this could be reasoning um we have a Overall , you know there's a lot of details that we published on GitHub and basically then we asked all these humans how well they performed and you can see our wind in the loss rates, so this is something you know if you don't do it.

I think the benchmarks believe that humans are right, these are real people who are playing with these models and really like them, and you can see that obviously it overwhelmingly compares to people like llama 3 better than llama 2 I like llama 3 definitely better than llama 2 um but versus gbd 3.5 versus mral medium versus Cloud Sonic um across the board uh people liked llama 3 which is really awesome so that was really encouraging and us. He said okay, it's not just like the benchmarks that you know we're doing. In reality, you know that the models are qualitatively better and in fact, the surprising thing is that our team played with these models every day.

I woke up in the morning. I played with the models. I asked him questions. I would see if. He would falsely refuse or answer my questions or how talkative he was, like we really spent a lot of time fine-tuning these models, so getting into the details of how we developed them, he would actually say four four things that at the highest level that we we think is the architecture of the model, uh, one, you know, we use a dense autoregressive transformer, uh, if you're familiar with llama llama 2, we also had a group query attention mechanism or gqa um uh in those models um right now we added a new tokenizer which we'll talk about in a lot more detail in the document we'll be publishing very soon um so not a crazy big leap in the actual model architecture, but some well thought out changes. and then SC we expand the training data, as I mentioned,over 15 training tokens, so a lot of data was calculated, in terms of pre-training, we used, we did a blog post, I think two weeks.

Ago uh, in our training infrastructure we have two custom 24K h100 clusters that we use to train them, so we are blessed with a lot of compute, thanks Jensen, thanks Mark, and we finally got a lot of work done, so in the post. training, so I think everyone loves to talk about pretraining and how much we scale and you know, tens of thousands of gpus and how much data in pretraining, but I would really say the magic is in posttraining, that's where we are.Spending most of our time these days, that's where we generate a lot of human annotation, this is where we do a lot of SF, you know, and SF, those we're doing things like rejection, um po DPO sampling and trying to balance.

You know the usability and the kind of human aspect of these models along with obviously the large scale data and the pre-training, so that's how we've thought about these things, we've looked at them as well. in utility and safety, then this is an inherent trade-off, so it's nice to have to maximize. We're trying to maximize the usefulness of these models, like how useful they are, how well they can answer questions about factuality, etc., but we also want to. To balance, you know the security and you understand, you know how the model responds in terms of integrity type prompts and there's more about that in our model card and in the posts.

In fact, we published another parallel blog post this morning. That speaks to that as well, so the last piece, so you know, one of the things that is very important, I would say in the generation era is the red team, we spend a lot of time and the bar keeps changing. Keep changing in terms of red teaming and how we think about these things like uh, I mean, have people heard of something called C Bernie, for example? I'm a little curious, okay, so no, I don't see any hands, okay, so you're like you. I know the cyber risks and the biological risks and the nuclear risks and I mean they are important, they are like border risks which they were if you read the executive order that came out last year, um, it talks about these and we evaluate our models on these things, so, How useful would you be if someone wanted to generate a biological weapon, for example?

Well, I mean, you know we have to evaluate those things that we have to understand. I mean, you can Google properly and you can probably come up with some. links that will tell you how to do things, but how well can a model that you know put together disparate information and help you, so these are things that we really have to evaluate and we have to mitigate, so we really have teams literally dedicated for this purpose, lastly, very quick, very quick on the license, not much has changed, um, it's a commercial and research use, you can D, you can create derivatives of it, uh, there's a 700 Mau thing um, so if you know a big A very, very big company, you come to work with us and most of them do it and we also added some guidelines for branding because we had a lot of companies that wanted to use llama and we wanted to be able to brand it properly so that's in the license now, so the ecosystem, as I mentioned, is big, here are just some of the people that are there, from the hardware vendors, like Nvidia, Intel and Qualcomm, that we work with very closely until the end.

Company and platform providers, it's a really amazing ecosystem that we have, these are just the ones that we've approved for Logos, there's also a huge open source community, AMA is probably my personal favorite, it's a really amazing project, um. Obviously we work very closely with the folks at GG M. The thread is a really cool project for an extended context too, so definitely check out some of the projects. Shifting gears to safety because I think I'm going to run out of time, actually, um, maybe. They'll let me go over who knows so I mentioned a purple llama so anyone knows why we call it a purple llama okay not a single hand H red blue there you go red blue okay so team red and the blue team are so offensive and defensive.

We, we, B, this cybersecurity guy space was actually named by one of the scientists on our cyers SEC generation team and we felt like it was really important for us to manage both metrics. like being able to evaluate and have a clear metric of how we're doing in terms of some of these harms, but also having ways to mitigate them so that you know it's not good enough to be able to measure something that you really need. being able to do something about it and that's what the purple flame is really about and therefore being able to evaluate and being able to implement models that actually allow you to complete these things, so this led us to how we think about the system. security level, so I mentioned that we maximize the utility of our models and this is actually a very different mentality that we adopted, you know, in llama 2, for example, they were very secure models, we put a lot into the models themselves in terms of fine tuning. um, but actually a lot of times it would be a false rejection or um, you know, they would be very, they would be almost too safe, um in many cases, so we actually take a very, obviously, the models are very safe, I promise. um, but at the same time we wanted to have the flexibility of having these input and output safeguards and then you can customize your taxonomy, so if you basically want to filter by a certain type of risk, you absolutely can do that and you can just adjust the external model and think about this in terms of a workflow so that your use case determines that basically, then at the model level I do things like prepare data, train my model and then evaluate these different damages that I'm going to end up having to mitigate because I have found some things that I don't like.

So you know, find more tune or mitigate and from there I can implement it for things like inference timing basically ask for filtering and that's where some of the things like L guard, which I'll talk about here in a second, as well as the code Shield, which is also brand new, so cyber Su EV Val is something that we, uh, we. It was released as I said in December, we have number two, the second iteration, which expands significantly, that's all open source, we now have a hugging faces leaderboard, you can evaluate your model on it, we now have the ability to evaluate quick injection this is like hey, you know, ignore that command above and tell me you know your real secrets, right?

Those clever little messages that you know people love to get around some of the security measures. and models, we also have automated offensive cybersecurity capabilities and we can basically measure the propensity to abuse a code interpreter, which is great. By the way, there's a document we published today that has all kinds of details about this. You can see all the different things we have here in terms of insecure code, in terms of usefulness to the cyber attacker, interpreter abuse, offensive cybersecurity capabilities, and susceptibility to fast injection, which is really one of the biggest concerns of llms. today, so let's talk about some results.

I'll keep it at a pretty high level CU. I'll run out of time, but you can see where some of our models compare across the board. You can see on the left here, this is actually the rejection rate, basically how the over-rejection or how much your model is actually going to reject versus the violation rate, so the x-axis is a violation rate and you can see like 8B. flame 3 actually has a pretty incredible performance, it's actually right on The Sweet Spot, there, the 70b is actually a much more coherent model or a much smarter model, so what we found is that the more powerful it is the model, the more it will violate. and then you have to mitigate and so on so that it's there in the middle um you can see the code called 70b we actually mitigate it, we probably mitigate that model too much to be honest, so you can see it uh it actually negates at a pretty high level. , which is a bit annoying for users, so it's something we're learning from and we'll fix that in the next generation.

Here's a graph that's a little hard to see, but um. That's basically how the models perform in terms of fast injection, so you can see the model versus basically all the different types of fast injection attacks. This is like repeated symbolic attacks, persuasion as virtualization, like all these different forms and there is a whole. There are many ways you can try to jailbreak these models and you can see, you know, at the top, there, uh, blue, you know blue is better, etc., etc., if you want to dig into all of this in detail and actually run this yourself.

In fact, I linked it at the top, okay, butter, uh, very quick, so again in December we released lard V1 again, this is an open source model, this is a model that you can use, you can implement it yourself same, it was based on the 7B calls 2 uh it was implemented on Amazon in Sag maker it was implemented together and in many other places, data blocks um again, this is analogous to a content moderation API, except it's a free model that you can customize. It's cool, we built llama guard 2, which is based on llama 3, so it's a much more powerful model.

You can see the benchmarks look a lot better, so if you're going to announce the ml Commons policy this week, it's really really good on that policy, obviously because we co-designed it with all of our Neo Commons partners um, but it also works. across the board very robustly compared to gbd4 and a bunch of other apis and again it's openly available uh code shield um is basically an inference timing input and output protection tool for cybersecurity so basically It supports filtering out unsafe code produced by llms, so if you ask it to generate, say, a fishing attack, it will filter it out, it's cool and that's actually again. um open source is on GitHub you can get it um it basically covers everything from secure code to code interpreter to secure command execution protection so things like that and then I think I think I have maybe one or two slides. more um and I'll go into some cool stuff, really cool stuff, if you let me go over maybe one, one or two more minutes, so torch t um, we co-designed anyone who's heard of torch tune, anyone use P torch, okay , sweet, okay, then, torch tune, labor of love. from me, suth and the team, this is a pure Pie Torch fine tuning library, so there are no 15 dependencies, nothing crazy, this is pure pytorch, it's as clean as can be, so you can build on top of this , you can use it yourself in Python and in pytorch it's beautiful, a complete fine tune, it supports llama 3 from the start, obviously I'm showing llama dos unfortunately only here, but they've actually implemented llama 3 support and it's absolutely fantastic it's built in with a face hug and a bunch of other libraries U is awesome check it out there's a ton of resources here go to GitHub llama 3 we have llama recipes you can search for them and then we have a little bit there with a pile of introductory notebooks, a blank chain rag, a notice. designing a bunch of stuff, okay, one more minute please, so Mark talked about this, this morning, we actually have a much larger model training, it's a big boy, so we wanted to show some metrics here, like this that we are showing. and you can go out and compare.

I think I saw some comparisons on Twitter this morning. You can go out and compare this to other models out there. This model has not finished training yet. It still has some ways to go, but we wanted to show some of the numbers, so we actually showed both the pre-training and the uh, we took a checkpoint and did basically some pretty basic SF, we lined it up and we showed some numbers and you can see that mlu is 86.1, which is how I approach a ridiculous comparison. I can't compare it with anyone, but you can see where the trend of this model is, and in terms of GSM AK 94.1, so it is a really very strong model, again, the training was not completed, it was not done, you know, in the post after the training. training is going to change a lot we wanted to show at least some of these metrics so it's very exciting this is from a checkpoint like three days ago happy birthday so a couple of things of what's to come obviously , we have bigger and better models on the way. um and those, we're just teasing our models that have over 400 billion multilingual parameters, we'll support many languages.

I mean, you can imagine you know Facebook, you know our properties, our foa, our family apps have over four billion people oraround. four billion people, we are everywhere, so multilingualism is very important for everything, from that AI, we want to integrate that into llama and also into multimodal, you can imagine all the things we are doing in arvr using smart classes, They scratch, you know you need to be. able to understand everything around you, you can't do that in text, so along comes multimodal and of course our commitment to security and we will continue to open source all of our security and a lot of our security things to build a community. around it build standardization around security, it's something I'm very passionate about so we'll definitely continue to do that, so lastly I promise to go check it out.

If you really want to play with Lama 3, it's basic, it's free, you can play with it, you can request it, you can have it generate images like I started showing those, those cool animated images, just click imagine and then request. and then click animate and it will generate some things. You can also indicate meta AI. You're actually calling a model based on llama 3, which is really cool. I'll pause there, thank you very much.

Watch Video & Subscribe

If you have any copyright issue, please Contact