YTread Logo
YTread Logo

A data-centric view on reliable generalization - Ludwig Schmidt | Stanford MLSys #71

Mar 23, 2024
foreigner welcome to episode 71 of the Stanford MLS Seminar Series, of course, associated with an excellent CS 324 thread advancements in the Foundation modeling class uh, this quarter we're joined today by ivonica and Michael say hello YouTube hello YouTube uh and also Ludwig um Ludwig has a great talk lined up for us today on a

data

-

centric

view

on

reliable

generalization

um and we're very excited to hear from him um Ludwig is, of course, a professor at the University of Washington, where he's been doing a great job, he also did a great job on his PhD at Berkeley before that, so we're very excited to have him with us today and Ludwig go ahead and share our screen and take it, great, thank you. and thanks for the invitation and thanks for the introduction.
a data centric view on reliable generalization   ludwig schmidt stanford mlsys 71
I am very excited to be here today. uh, let's double check that this works. You see the one-slide

view

, not the proper two-slide view. Yeah, it looks good, okay, great, then you're ready. To start off, I'll start, yeah, go ahead, okay, cool, yeah, thank you all for joining today. I'm going to talk about a

data

-

centric

view and

reliable

generalization

, so I think everyone agrees with a higher level of motivation than in the last 10 years. There have been really dramatic advances in machine learning on the research side, but a lot of progress across various benchmarks and now the ambition is that we can take this over a few years.
a data centric view on reliable generalization   ludwig schmidt stanford mlsys 71

More Interesting Facts About,

a data centric view on reliable generalization ludwig schmidt stanford mlsys 71...

The ambition is that we can take this technology and apply it across a wide range. range of applications in the real world such as transportation, for example, self-driving cars, various healthcare applications, such as x-ray, imaging, robotics, online content moderation, of course, in 2023 we also have to add Chat Bots to this list and this is all great, I think machine learning. It will have a tremendous positive impact on these domains, but there is always a key challenge when applying machine learning. I understand that it must be reliable. Well, this is obviously a big area of ​​research right now, especially at Stanford, trying out a lot of interesting work on using robust systems. change and so on and in this talk, so if I want to step back a little bit and around this question, how can we make machine learning reliable?
a data centric view on reliable generalization   ludwig schmidt stanford mlsys 71
So I know there are two approaches to this question. One is, I think also maybe. did the more traditional one in machine learning where we look at our training algorithms and our model architectures, etc., to improve some, so that the resulting model when we train it on a data set is more robust to the distribution, so it is more robust in diversity. examples, etc., and we've seen a lot of interesting advances on that side, but there's also a complementary view that it can be a very promising way forward which is also improving the training data, so maybe we already have really good algorithms . and we just need to get better training data into these algorithms to make machine learning more reliable and sort of in the last two or three years that I've seen machine learning and been involved with my turn, mfu has shifted more to this . a better look at the training data and basically what I want to do in this talk today is tell you a little bit about this journey in the context of the desktop version, so we'll start with a broad overview of the robustness landscape in the PC version starting at the end of 2020 because at the beginning of 2021 openai's Club model came out, which made super exciting progress on a wide range of robustness benchmarks that previously seemed very, very challenging and then there is the obvious question: how can we make these models like clips and such now that they are available even more robust and the other where does all this robustness come from in the first place and this is where this Insight and actually all the robustness comes from and the clip come from better pre-training data and then the At the end of the talk, I'll talk a little bit about the future directions of what's currently happening in the research world around how to create better pre-training data sets for models image text.
a data centric view on reliable generalization   ludwig schmidt stanford mlsys 71
Initially, I think I really like it when these talks are interactive. With the current format we can't ask questions live, but at the end of each of these sections I will pause briefly and if you have any questions, I think there is a Discord chat that I listened to and sent them and I hope you can forward them to me and then I can try to answer some of them and we will have the discussion at the end of the course. Okay, great, let's dive into the first part, the robustness landscape overview and the desktop version, and we'll do this in In the context of probably my favorite machine learning dataset, which is Imagenet again, we had Stanford at At least almost two days and everyone is familiar with Image Net.
I think this is still a very, very useful resource, because we have accumulated a lot of experimental infrastructure knowledge. around this data set, making it a really rich test, but to experiment with new models, a kind of hypothesis testing, etc., and specifically in the context of the robustness image trend, always It's been a very safe ground for experimentation because so many people have instantiated this. challenge, hey, how can we make machining more robust specifically in the context of Imagine entertainment? There's this widely used model of imagining a test suite that got better and better over time, but when you implement these models in the real world, when you apply them to a new test suite.
There is often substantial work on performance and even on images that look perfectly fine to humans, the models still get it wrong, and to make them more rigorous in a sort of benchmark-driven evaluation paradigm, several research groups have proposed different out-of-distribution tests. sets for emitter, so ideas and all these test sets, classes are the same as an imagenet, so you can take an image in a trained model and evaluate them on these OD tests, so an example was the first line in the first line. of the work here is mentioned in V2, this is what I did during my postdoc at UC Berkeley and then there is a lot of work from various places in this direction as well as object numbers from the kennebaum group at MIT, imagine those sketches and a good CMU imageet r example that also came directly and they all tested different types of distribution changes in the context of imagineet and this is obviously just a way of testing robustness, yes, adversarial examples, other types of corruptions, etc., and in some cases.
At points like 2020 this got a little confusing from my perspective because a lot of the research groups had proposed a lot of different distribution changes in the context of Imagenet and then what we did in this paper here towards the end of my postdoc was Do a very large evaluation study of distribution change in computer vision, specifically Imagenet, so like the quora object that we built in this paper, it's this big evaluation matrix here, so we have two dimensions on the x-axis, we have 200 different distribution changes, 200 different ones. evaluation configurations on the y-axis we have 200 different models in our testbed, so each cell in this Matrix corresponds to the evaluation of one model in a test condition.
The total corresponds to around one billion image evaluations. Nicholas Cullini, one of our contributors, ran some Google, which was very helpful and just to give you an idea of ​​what's in this big test, but here, let's look at the two axes separately, so on the y-axis here, The Dimension of the model, it is useful to think about. of these models in three different categories, the first is what we call standard models, these are models that we introduced with the sole purpose of progressing in the imaging lab, nothing specific about robustness, these classic architectures like Alex and vgg resonate in dance net, etc. then the second category of model is what we call a robust model, so this is an image type model where the authors introduced some modifications to the model specifically to make it more robust, think about adversarial training, special types of augmentation data, special filtering layers, etc. then the third type of model that will be important is models trained with more data, so not just the emotional training set but more additional data is great, so that's the model I mentioned, what's happening in the x-axis distribution axes here, we basically summarized it in In the direction of the model, we tried to be as complete as possible and get everything we could find on GitHub, plug it into this test bit and then see what happens, so that we have distribution changes like MHL V2 object net and let's mention R, let's get that.
I already mentioned image net and then we have a dataset that tests robustness and the videos are imagined a bit robust obviously we have LP confrontation examples and they are image corruptions like imagenet C etc. like you could give a talk complete about this. Like testability and the different findings today, I want to focus on a specific phenomenon that we found and to do that we need to establish a specific way of looking at robustness, this will be in the form of the following scatterplot. So I go through it step by step because it's going to be a key part of the stock, so what you see here is on the x-axis the accuracy of our model plus what was evaluated in the standard management test set on the y-axis. model but now evaluated on an out of distribution test set like the V2 emitter and each point in this graph corresponds to a model in our test set and to start I have only put the standard models which are the blue points and as I mentioned they range from alexnet, like the fundamental paper in the lower left corner, about vgg resin, density, etc., to the efficient net B7, which was state of the art when we did this, devaluation and one thing to keep in mind is already good.
I have drawn the hidden line Y equals in the precision in the distribution change and one thing that is quite notable here is that all of these models follow a very precise linear trend, so why does that happen? A very good question for now, we'll just take it for granted because it will be a very useful baseline, so if we assume that and As far as we know, this is true if we assume that all the models being tested in the image will fall into this line, so if someone tells you, oh, I have a new model here, they will get an image and 72 percent accuracy, you can look. in this red linear trend here and then we'll see well, the expected accuracy outside the distribution that I mentioned in V2 will be about 58 percentage points and this will tell us whether the model is noteworthy or not if it gets 58 percentage points.
It behaves basically like every other model we've seen before, so what we'd like to see is for there to be a more robust model above this red baseline here on the scatterplot. Well, then I draw. I drew a green star here and this is kind of I have a hypothetical robust model and the rise above the red line is what we call effective robustness and the question is what kind of training or data interventions give it effective robustness and just clarify that this is a good goal. To point to um, in one of our papers at Berkeley we did an evaluation with five x labels per human to see where humans fall on the scatterplots and they're basically on the line Y equals below the Y is equal to the line are wrong, this is actually something that models should be able to do if we aim for human-like generalization.
Okay, so this is where I normally pause for any questions about this robustness framework, so I'll take a swipe of order now and maybe if there are any questions we can talk about. If not, we'll continue okay, yeah, that's nice, um, yeah, a couple of questions from the class, um, I guess some students are interested in how, maybe this is taking a step back a little bit, but, how something like Define a functional definition of robustness, we could imagine that saying I don't know if we completely change the test region, surely we wouldn't expect the model to fail, so is there a good kind of like a quantifiable notion um, I know that you're talking about something like effective robustness here um strange, I think for me basically when we talk about robustness and these computer models nlpo um it'ssomething where we assume and say well it should be possible because I can classify this image but somehow my model can't, that means for me the reference is always fine - to what extent can a human classify these dimensions into the image type context and layout changes now?
Therefore, from a solidity perspective. humans are the right target here and then basically effective robustness measures how far we are from this ideal line Y equals X and depending on the data set maybe Y equals X is impossible maybe really have five percent noise on the label, but in general it would. I would like to say that if there is this Baseline, the red line, we would like to raise the models and then we say this is more robust compared to this set of Baseline models, we got it thanks, and actually just curious about this, huh .
I guess it's kind of a robustness plot, it's interesting, it's not exactly the same, but it draws an interesting parallel, so let's say someone like these recent ones, like climbing wall papers, I guess they're big language models and when we expand, let's say. the number of parameters, uh, we can predict, let's say, precision, and these are actually super interesting, yeah, how there is a very strong linear correlation here, do you have any? Is there anything we can remove? These individual points are also points from the same model family, let's say, but maybe expand them as a secure user with many more parameters, or yes, yes, this is all a very good question.
Thank you very much and I'm glad I did it. I can talk about them, so first of all, Synology for scaling noise, I think it's really good, and also maybe this goes back to the previous point, like how do you quantify robustness, because if you take a very pragmatic perspective here you could say good. I don't care about the whole maybe slightly more conceptual type of story that Ludwig is telling me. I only care about what gets the highest accuracy from my distribution art test suite. I don't care about a baseline, I just care. about the highest precision and this is a pretty reasonable point of view, the confidence issue is from a research perspective, there are so many different models that you can compare them with and the nice thing about this linear trend is that we can really separate the robustness of baseline precision. kind of like I can talk about the robustness of a resonant 50 modification regardless of the baseline accuracy that the resonant 50 gets.
I can see well, we are moving forward and maybe they are already modeled more accurately in Imagenet V2, but if we can move . The above the line model has something interesting happening because the idea is that if we can take that kind of trick that makes the model move above the line, we should combine it with the next generation model and then we'll get something even better . What I want to do with this definition of effective robustness is to couple the precision of the distribution with the robustness and have a notion of robustness that doesn't depend exactly on where you are on the x-axis, like the scaling law, so basically you want an invariant queue. notion of robustness to be able to compare the robustness of a model with a precision of 60 with the robustness of a model with a precision of 85.
I understood it, thank you. Excellent questions. Should we continue? I think join us or if we continue we can do it for now thanks okay I also want to say these are great questions I'm sure it will take more than 20 minutes but that's okay I guess we can always speed up at the end. Okay, so this is kind of the framework for these scatterplots and I like it. the big questions I just mentioned ask any of the robustness interventions that people have proposed for 2020 that actually help with this type of distribution triggering and to measure this incorrectly we simply ran the 200 models in our test. but now in Imagenet V2 and we plot them all on this graph, so maybe the first visual impression here is that this almost looks like the previous graph, there is still a very clean linear trend here, the red line and basically all the models are in or very close to that line and to be a little more precise, we have added two types of models: the robustness interventions, the brown dots and the models trained with more data, the green dots and basically, especially if you look at the higher precision regime, all The brown dots are basically time, so they don't help with the original V2 layout change.
The only points that help a little and are a little above the red line are the models trained with more data, so I highlighted them here at the top. Well, we have a Google model trained on 300 million more images on a 300 million jft data set which internally is like Google, then we have a couple of Facebook models trained on a billion Instagram images and then we have a model trained on the original 21k type. I like this full version of Image Net, not just the 2012 competition dataset and just to put this in perspective, Image is a million training images, so when we talk about the Instagram construction dataset, It's a thousand times more data and we do it. it goes off the red line, but only by one or two percentage points, but it's a gain after being okay, so far, good, also to be fair to the solidity interventions, they do something for the type of distribution, so that if they are designed for like, for example, adversely robust models that help assess robustness, the problem is simply that they only help to a limited extent in this type of robustness that they are designed specifically because they do not make the model more robust and broader for real world distribution changes, but All of this right now is in the context of Imagenet V2 and some kind of worry would be ok now, maybe this is just this data set.
This is a funny thing and what's happening in other data sets so let's look at one more example from a different data set and there's this object so it was built by a different group now I'm not involved in this it wasn't involved on this in no way, I'm not involved in the object yet, I think it's a really good article, basically what they did They also wanted to create a distribution change dataset for Imagenet, but now specifically to challenge emotional models, let's imagine that we also had a different motivation. We originally wanted to build a new tester that was very similar to Much Net and it turned out even.
If you do that, the models are still fragile objects, let's take standard objects on a neck like a chair and then turn them around in unusual poses, like upside down, let's put them in unusual positions like a kitchen chair, maybe now in the bathroom. that kind of thing to break down some of these correlations that you have in data sets like a lot of network and I thought, okay, this is cool, this is exactly the type of data sets that we should experiment with, so we took the network of objects and we plugged it into our testbed and then we got the scatterplot here and again, the first visual impression is that this looks pretty similar to what we've seen before.
One big difference now is that if you look at where the dashed line is equal to image we did before, but overall all the models are still pretty close to a linear trend and then the only models that are above the red line are again some of the data models that we just highlighted earlier, so let's model from Google or Facebook that is trained on substantially more interesting data and again, there are many more data sets that we could talk about now in the interest how to advance the token in a time list. I'll just tell you briefly that this is not specific to the Imagenet classification, other groups, some of them, or other articles in which some of them have been a course. in studies of similar type of distribution change in the context of emerging as MRI reconstruction 60 pause estimation, which is interesting robotic object detection and general listening to Trends in distribution change are quite common in vision problems by computer and we also looked beyond the computer version of NLP for Specific example of the team's data set because we will stick with today if the movie is good to highlight that we built four distribution change test sets for Squat and generally you'll also get these nice linear trends in the distribution change and so you like this high level message here.
Also, the fact that robustness interventions don't help much in real-world distribution change is also something that other work has found, for example, there is a very good article in the context of dimensionalization, and I'm just going to cite the summary here. We ran expensive experiments using the domain bed, which is your new testbed, and found that we carefully implemented empirical risk minimization or state-of-the-art performance on all data sets, so this is a similar story to which we also discovered: only the standard models. the blue data points that we're doing as well or better than the explicit soundness interventions are fine so I think one thing that's revealed here is that ERM is actually a really strong baseline and just FYI if you're interested in this type of linear trend phenomenon. we have a document on that where we do a lot more in the studies when it happens when it doesn't happen for the purpose of the talk um I want to come back to this because it was like this two ways to block it like this on the On the one hand, we're seeing that okay, the more data models there are above the red line, but overall it's pretty small, again, right, it says the Instagram model has a thousand times more data and we only get one or two percentage points that you can use like a fool. package of the envelope calculation and then you're going to end up sort of oh to close the scap we're going to need 10 to 10 times more data and that's just a lot of data so by the end of 2020 I was a I'm a little skeptical about this.
I'm not quite sure where the progress will come from, but then Open AI released their Club model and they made tremendous progress on all of these benchmarks, so this is the second part of the talk now, um again, I take a brief break here for questions and we continue, yeah, so expect many, many questions from students. Can I try to solve the fish ones? So when do we start our time? I'm not completely sure. Well, actually we do. fact 3 30. oh okay it's already 20 minutes okay so let's ask maybe a question a question what a discussion at the end maybe there's something that several people asked or that sounds good yeah so I guess I'm referring to something that I think is timely here.
As you mentioned, okay, between all these different types of intervention methods, just more data seems to help, but do you have kind of a quick idea of ​​why we would like or would like to know what the criteria are for more data? any criteria is just yes this is a great question, also to be precise about it, it's not just about the quantity of data but the quality, in a sense I would like to be able to specify more precisely but currently I can't specify more. precisely, for example, it's actually a good point if you just take Imagenet and subsample the imaginal training set by a factor 248 1632 natural experiment um train on Imagenet but fewer images you're moving exactly along this red line, so basically first when we saw something is above the line by getting more data we are very excited because they thought it was ok if we click on the last day it will go below the line but actually we moved exactly along the line, so this red line is really a property of your training. the distribution is not a property of how many samples you have of the distribution, so to go above the line you need a more diverse distribution for some notion of diversity, so I don't know how to make it more precise right now, perfect, thanks brilliant.
Well, let's delve into the second part about the club and the strength findings. Well, the clip came out in early 2021. Many of you have seen this document and it's kind of like your first results table in the blog post and that personally was really cool. I walked away because I had seen these test suites out of distribution for quite some time. First of all, I was very happy that they used exactly our test set to evaluate robustness and then I was very impressed because they do exactly one robustness effect. In the evaluation here, they compare their largest clip with itl with aresnet 101 trained imaginal and then they get this point plus six percentage points on the image in V2, which is half the drop and then plus 50 on the R dimension plus 40 on the net object plus 35 for an image, let's capture and plus 74 in immersion in these are very big games, right, if you follow the leaf, the literature related to imagenet, people happily write articles about small single digit games, double the chickens, basically it is unheard of and then it is an extreme area, like this I thought, okay.
Great, now we make a clip. I need to better understand what's going on here because one really interesting thing about paperclip is that nothing about this model was explicitly assigned for robustness, right? Let's quickly recap how the clip works so the training data set is a data set of 400 million image text pairs, so these were collected from the web and we have images with associated captions and then the architecture of the model here is the following, you have a text encoder like a Transformer Tower that feeds the subtitle through the text the encoder reaches an embedding space, does the same with a separate image encoding model for the image side and then you have than contrast loss which aims to map embeddings of matching image text pairs very close together and then if the image and text for PS or Sorry, image text pairs that don't go together, very far apart, those are classic, they contrast dental floss as we've seen, for example, and some clear ones, but now moved to the multimodal domain and then apply, I guess, a kind of large-scale standard machine. learning the paradigm of training this for a couple of weeks and hundreds of gpus with a small model, I guess, by NLP standard by computer vision standards, a decent sized model with 300 million parameters, okay, like this It's how we train it and then we want something really cool.
One important feature is that you can do what's called zero-shot on Threads, so now if I have a new sorting problem, I can use the actual class names as sorting targets. Let's say I want to sort between airplane dog and bird and just feed. Those in my text encoder, I now have an embed for all of these text strings and now when a new image appears, I can see which of these text embeds the image embed is assigned to the closest and then I just grab the embed nearest text message. and this will be my answer to the classification problem and this is really powerful because it makes these models flexible to use.
It is not necessary to first create a fine-tuning data set. You can just work with the textual description and the interesting thing is that when you do zero-shot inference you get this awesome robustness behavior, so this is the same kind of scatterplot that we've seen before, this is straight out of the paper clip, in the robustness evaluation x-axis image like before the y-axis, they just pragmatically averaged seven flat images out of the distribution data that we have talked about like originally V2 object net and then the blue line here is the baseline from the image that only trained the models, the green dots or the previous more data models and then the orange line is Club zero shot and You can see that it's basically closing half of the robustness gap and we don't know if you can actually get to Y is equal to here, but the obvious question here with the orange line is where does all this robustness come from and then another kind of obvious question that came out of this was what's going on with fine tuning because we have another line here, the red line, so The red line is no longer the baseline, the red line is still on your foot for the clip models that are fine-tuned in imagenet, so they took the clip and did the fine tuning of the last layer and then the following happened : models improve in a commercial network.
I mean, if you tune up and emotionally you should get better, but they actually lose a little bit of robustness in terms of the y-axis, they move down the y-axis, hope would be good if you find an image network that you like. almost a movement along the orange line, that's not at all what's happening, you're busy moving in the right direction at Axis y. It's not in this plot, it will appear in a couple of slides, but this was a real problem when the clip came out and we had these models that were good but they weren't state of the art, not when you try to do instead of the fine tuning of the hardware it actually erases a lot of the robustness and this was an important aspect that also put some of these results from previous data into context.
We had models that were trained by Facebook on a billion images, but the only way we accessed these models. I was always looking for fine tuning in imagenet and it turns out that fine tuning an image actually erases a lot of the good robustness properties. An obvious question here is: can we fix this? Can we guess we'll get the best of both worlds? distribution and precision outside the distribution and the answer is yes, we wrote a paper on that robust fine-tuning of zero-shot models. This was my first article. A few contributors were really fun and um people from Colombia hung up and other Chong walked from open area Medical care um cool and kind of the answer to can we do this?
It's embarrassingly simple, not right, let's just recap what the problem is, we have the support line here, zero chocolate models and their tuning, it moves in the right direction, the x-axis but down on the y-axis, how can you fix this? And the solution is actually just a line that interpolates between the zero shot model and the fine tuning model. I mean, here I say linear interpolation. I really mean you just take the network. rates and you interpolate them linearly together, this was very surprising to me because the whole point of a neural network is that it is a non-linear model, so why would you linearly interpolate a non-linear model?
There are reasons for that from a kind of remote connectivity landscape. why it was good to try this, um, what a link Gabriel noticed very, very well in the literature and he was kind of inspired by there and this worked very, very well in the sense that if you do this linear interpolation between the shooting model zero and a fine-tuning model with interpolation coefficient Alpha, then you can plot the curfew and the nice thing about this interpolation curve is that it basically first moves up and then to the left, it's not a direct line between the star and the square, but there are these points. in the middle they combine the best precision indistribution out of the distribution and just like the Alpha fit, it's not a big problem here either, basically in all our experiments, Alpha 0.5 really only taking half of each model is almost as good as the best possible time.
So this was just a sketch to illustrate how this method works and now this is kind of a real data version of these sockets, so we have two fine tuning endpoints, the square and the diamond, the square is for entering and adjust the the diamond is for the fine tuning of the last layer, so as I mentioned, if you look at this square, it's actually moving down in greater robustness of the fine tuning from end to end, but the nice thing is that you can do the end-to-end fine tuning and you get a substantially better interpolation line, so generally depending on where you start as your baseline, the diamond or the square, you get five to nine percentage points in change of emerging distribution.
It improves without losing indistribution precision and this also works again on several other data sets. So this target story image is not centered because we have these very rich baselines there and the rich evaluation benchmark, but this works the same and so for 10 it gave us the state of the art accuracy in the vials at that time and look at some. many other data sets we have the various extensions of that, one of them is called model soups, a follow up paper where we use this to improve fine training, even ignoring robustness, it can answer the questions basically why stop two models if can average between two control points.
Why not just average three, four or five control points and we call it model soup and that provided state-of-the-art accuracy in imaging and distribution? One good thing we did recently was that we trained the public as accurately as possible. clip model, so 80 zero shooting accuracy in Imagenet, so as the open AI checkpoint, the best you get is 75 76. This is very good because of this repository we built in the open club , we can now train better models than the public checkpoints of open AI. Great and again, since we want to like to not just understand the computer version but also machine learning more broadly, we're also doing similar types of assessments now in the context of NLP, so this is something we did at the end.
Last year, 350 different models in 16 different ones. The QC data sets on the quad x-axis are a reference data set on the y-axis averaged over the other 15 data sets and then you also see these nice linear trends and you also see that the stereo shot is the shape stronger way to make inferences. Yes, again, much more. in this document I don't have time to talk about this today, unfortunately, in the interest of making progress here, where the solidity comes from, we go to the third part now, but maybe people have a couple of questions about the part. two meanwhile, I mean, yeah, I guess so, there's a ton of activity in the class, the Discord chat, um, I guess so, one question that people were interested in was trying to disentangle contributions and robustness, um. from the clip, people noted that I would have liked you to have this kind of multimodal architecture, where yes, you have the text encoder, the image photo and you also have the contrast loss, so you have those components, but then you also have I guess all this scale of internet data, um, so I guess yeah, I don't know, have you looked into something like I don't know ablations or something like that?
Well, this is a great question, thank you, so let's get on with it. Because this is exactly what the third part is about, it is basically about all ablations, so we wrote this article to study very rigorously where the robustness and the clip come from, and the definition is very clear and, from my perspective, data is very important. to improve robustness because you can really pin this down and say quite precisely that all the robustness of the effectors comes from the pre-training distribution, so this is this paper here, well, it's an icml last year's data determines the language image distribution and contrast robustness - we pre-trained and did exactly what it sounds like this team wanted us to do. we just go through a lot of different hypotheses and test them one by one, let's go over them just to make sure we're all on the same page. and cover everything, so the first one is the super version of the language, obviously, as a standard image and supervised learning, there is nothing multimodal, it's just imaginative with these categorical labels, now with the clip that incorporates the language, maybe this give you a big boost of solidity, I actually think this was one.
Of the main hypotheses, the training distribution, I think was the other big hypothesis: if we go from Imagenet to this more diverse web crawling pre-training dataset, maybe that guesses its robustness. I put question marks here because unfortunately the club's training set is actually not public, they released a checkpoint model. We know some things about the training set, but it is not public, so next is the size of the training set. This could be a reason like 400 million for a club, 1.2 million for a standard image, unsupervised learning. However, I already mentioned above that you can downsample or upsize your training set as long as the same distribution moves exactly along the line, but just for completeness, I put the CF loss function, obviously, also mentions that it is contrastive, since supervised is a different clip test time prompts, you can do prompts and then you actually like the clip to have a set of 80 prompts, so it's not unreasonable to say that maybe all of this comes from really smart problem engineering or at least some of the solidity while an image where you just don't have any indication capabilities and then the model architecture you mentioned before, the best club models are vits, most of the models thatwe look at for imagenet.confidence is fine, so in the paper we actually look at each one of them. of them for the sake of time here we are going to look at the key experiment to discover how the effect of language monitoring is intertwined with the effect of training distribution, so conceptually how you can think of this is this: as if we had two axes here, one axis is training distribution, imagine it versus yfcc, the other, this access is whether you do language for kink or not, okay and let me quickly check something here, okay, we don't have as good, um, just for context. what is CC live and why cc is Yahoo flick creative comments dataset which is an image text dataset which is public and we know that yfcc is a subset of the clips training set and we also know that ycc is representative in terms of robustness of their overall training set, so they released a subset of ycc that they used for their overall training set and this is representative of the overall training set and based on the experiments we did with them, It is really representative, that is why we can compare imagenet with yfcc webifes. as a proxy for the actual clip Training Center like the other dimension is like a loss of standard classification versus the um Club contrasts the floss that is multimodal with the version and the language and something like what we had before our article is like two quads and this twice, two grid images, net standard training obviously a lot of models and then we also started training our own club models in yfcc so with language and text super version sorry version of language, super version, but obviously, if you want to unravel, you want the other two parts. of the quadrant so you can do a controlled experiment and vary only one direction at a time and this is what we do in the paper, we do an experiment where we train with the number of captions in the image and we do an experiment where we train in yfcc in a supervised way and kind of how do we do this, let me tell you this a little bit.
How did we get titles for the images? So, taking a step back, the perfect experiment here would be if someone gave you the original titles that people wrote while uploading their data to the Internet and then people search the Internet for images to create images, etc., but this whole connection is lost , a trading network is just the dataset of images with class labels that we didn't know what. the original captions were where we get the text data from and, um, there were a couple of different options that we thought about, um, you could just create templates like a photo and then enter the class name that works, but the problem If this is cleaner and doesn't correspond to the actual language distribution and you see it in a data set like yfcc, the next thing would be to run an image capture model that gives you more diverse subtitles, but now you have this whole subtitle model like The confounding factor is also not as clear as you would like, you could get new annotations from humans that are closer to the true human generated language, but it is also more expensive and again, it is not entirely perfect because it is not the data that the people loaded originally, so what Vita later was fine. in fact, we get the original text annotations for Imagine as the original subtitles and the answer is yes, because from the days when we built Dimension in V2 we knew that about half of the original training set actually comes from Flickr, so somewhere on Flickr are These images with the captions that people wrote when they uploaded the image to Flickr we just didn't know how to make the connection so we really liked it.
We dug into the history of Image Net and found all this metadata, some of it did something. selections of similar images that match your stable search and basically to summarize, for half a million or about half of the image in that training set we were able to connect it to the original image on Flickr, which allowed us to build a new set of data that called imageet captions, which is 500,000 images, um with the original Flickr captions and then they could train clip models on that to see what the impact of robustness is when you add text and annotations to imagine it well, just to give you a pair.
Imagine caption examples, well it's an image in India, I actually combine different types of text, there is the title, there is the description, the user-provided tags, all this is a bit noisy, but overall the level of quality corresponds, uh, CC and then we could make these headers. upload experiments correctly so we can we have the blue line here, the baseline, which is the original train models, and we have the right line, which is the open AI club models, they have effective robustness and then the big question is Where are club models placed in issuers' titles? context here, if you train a club model on yfcc, you actually get to the purple line only at a lower accuracy regime and then we also train the club models on emitter legends and these are the green hexagons here and the point is that the green hexagons are exactly on the blue line, so adding language supervision to the image gives it little robustness and we did several similar types of iterations of this experiment, for example we also tried using Club's vitl text tower to let it be trained on what text data and use it. as an initialization when we trained our emotional caption clip models, but none of these things always stayed in the same robustness line effect as the image that only trained models and as I mentioned, we did more experiments and in terms of time.
We can only touch on this one, but basically after going through all these ablations the only thing left was the training distribution, like for example, you can do similar operation studies for contrast loss or for model architecture and none of these things. affects solidity, the only thing that matters is the distribution of training cool, um, quick time control, so how bad things look, so we have about 15 minutes left, well, we have five, five to 15 minutes left and this it's like including the Q. to yes, including the question a okay, okay, so we have to be okay, we're going to summarize things now in a couple of minutes and then we'll ask questions at the end, okay, then there's the next one obvious step, now you can design better sooner. -Training Datasets This is a paper we had in Europe last year where we looked at, how much do different web data sources really vary in their robustness?
For example, we built a library of five different web data sources. Red Shutterstock ycc lion conceptual captions Wikipedia and then I did these robust simulations and basically what you see is it's all over the place, like these different training sources really vary a lot in the robustness that they induce in the clip models when you train a clip model. online clip, so okay, there's more to say here. In the interest of time, from the results of the graphs, the next obvious question is how higher precision can be reached in the first place. It's not the third part of the talk and just now I showed you these lower precision clip models that are interesting. because we have this loss of scale reliable and with effective robustness, but in the end we want to get to higher precision and open source Club models and to do this, I and others in this open source collaboration depend on building a large public image text . dataset so it flies on 5B and this is where we train our state of the art club models and this is also where a lot of the recent generative models are trained like stable diffusion for example um it was trained on our lion dataset I think it's also good to keep this in mind when we talk about the importance of data and machine learning, as the common foundation for all of these advances we're talking about, like much more robust models and this amazing progress in text-guided imagery. . generation, the common thing really is always these big data sets collected on the web to make this possible, who are also playing to make this, we built lion 5B and this is cool and we don't have much time so I will point out the role and for how we built it, I think the important part here is that there is still room for improvement because when you look at the performance we get with the lie on the trained Club models compared to the open AI trained club models on the smallest K , like a b32 model, we basically match what openai did, but then as you increase the scale of the model, there is this unfortunate phenomenon that our dataset doesn't scale as well as the open AI one, so we have a gap of three percentage points in L4. 18 and the obvious question here: can you build a better pre-workout set?
Just from another recent article to make a comparison of the scaling law of the open AI pre-training data set, which is the blue line and then the lion, which is the orange line. and I mean, I think overall it's amazing that we have lion, it's a super useful resource, but I think the overall improvement on the dataset side is something where there's still a lot of room for improvement and we've got some really cool stuff. cooking, but that's part for another talk because it's still not finished for now, let's wrap it up so we can move on to the questions and discussion and yes, the high level point is that the AIS Club open model led to profits, robustness and classification of very large images and the main reason for the eclipse. robustness is the distribution of the images, it's not just about the scale, it's not just about them having 400 million images, but there is something inherent in the distribution of the state data being more diverse in a way that I would like to quantify it better, but for now we have to leave it at that. something very diverse and also to be fair to the language monitoring, so in terms of robustness I said that the language monitoring is not responsible for the robustness gains.
General language monitoring for training data is a very, very powerful tool, the only reason. I think they could create this much more diverse pre-training state because Club allows you to train with sweetly supervised data with language monitoring because then you won't have to worry about the mechanical track anymore, you can just track all the data and put it together. The clip can be trained on it as long as there is some text associated with the image. I think this is actually a big step forward in terms of how we can train large-scale models and, as we said, different potential sources of clip data differ. a lot on induced robustness and then I think now there's a big question.
I think actually one of the biggest questions in machine learning right now is how we can build training sets. The deal is widely reliable models. I think this is not just in the club world. but it's also good for the world of diffusion models and big language models, so this is all I had. Thank you very much for listening if you want to learn more about this. I already mentioned a couple of articles throughout the talk we had. our open clip repository for training and evaluating Club models and then if you like scatterplots, you can now go to robustness.mergenetv2.org and you can create scatterplots corresponding to any input in our 200 x 200 grains so that can place the order. of 20,000 scatterplots, so I hope it keeps you busy for a while, but for now I'm happy to answer questions.
Thank you very much, Ludwig. I'm going to go ahead and stop screen sharing so that in this part of the discussion you can see everyone's faces. I think it was a really great talk. Thank you very much for coming. Obviously, Lion has led to so many great advances in the last year with a stable spread and all the explosions in those models. um I was curious if you have any ideas about what equivalents look like on the language side, if you've thought about that, um, one of the things we hear a lot about in languages. kind of these mysterious emergent properties um that come with the scale of the model and things of that nature um do you think that uh and in your words you obviously show that with clip those robustness properties actually had more to do with the data than, say, the architecture or the um or according to those lines according to the scale, do you think there are hidden things lurking on the data side and in all those language models?
Have you thought about that? it might seem like, I mean, it's something that I'm definitely looking forward to exploring more in terms of hidden things lurking. I think we really don't know how much room for improvement there is on the data side because I think If you think about it, we have thousands of papers and thousands of ablation studies to test all kinds ofinnovations on the modeling side, like each article of a new architecture that are very good application studies and we understand approximately how much we are going to do. win by changing the optimization, uh, how much are we going to win by adding more layers and then for the training data sets, we don't have any of that?
It's like the training data sets right now are more like hundreds of terabytes of data are spun out and then like we don't have ablation studies for all the different design choices that led to this specific data set, they affect descending generalization. Now I only realized this after working on Image Net V2, how many design options it takes to make a pre-training data set and I think exploring that space is going to mean that there are certain things that I can tell you concretely, We can already improve compared to León by several percentage points and now we have better models than open AI in the same data model. size scale when making improvements to model size, so yes, there is definitely room for improvement.
This is now the textual image and then it is clicked, but I suppose a similar story is true for language, but it is something you still need to understand more since I mentioned immersion phenomena and this is not directly my area of ​​expertise. I can tell you that sometimes what I hear I mean a lot of anecdotal things, right, it's not like you talk to colleagues and so on. One thing I've heard is that some of the emerging phenomena become better performing charts if you plot them with different metrics rather than maybe with precision, plot log loss or something and then all of a sudden it looks like less than a curve in S and more like something pleasantly continuous.
I don't know, I mean, if you want, tell me to what extent you really understand that on the text side of the image I haven't seen any stumbles like something like discontinuous. Interesting behavior. Thank you very much, eh, Michael. I know there's a There are a lot of questions on Discord, so I'll leave them for you in the next few minutes, pick the ones you want and then hopefully we can get some of that discussion on Discord, uh, on YouTube, so Go ahead, Michael e Ivanka with some of those amazing questions, yes, thank you, yes, it's a different kind of curiosity.
I guess how much you think Ludwig is into this notion of having this multimodal data versus just having a lot of similar images or a lot of text. Another possible way to rephrase this question is: Do you think this combination of text and images can help? It could help improve the models. Let's also say language modeling. Or do you think something like that? The combination is important or let's say improve Beyond just saying having a large and diverse amount of image data. A very large, diverse amount of text data. Yes, that's a great question. I think it also depends a little bit on exactly what you are specifically.
When assessing this robustness on these emotional test sets, it appears that language monitoring does not add any robustness. If you had the same set of images with categorical labels, you'd probably get the same robustness that we actually did in this experiment, as in the data. The role of determinants that I talked a little bit about we also have kind of the other quadrant where we take and CC in the image text data set. We throw out all the text labels trained only on the images and they are self-supervised and then Basically you get almost all of the effective robustness that you get from the Club training in yfcc um yeah so this is one perspective having said there are also articles that they come out as a clear contributor who just shared one yesterday or the day before yesterday where I found that multimodal data did help and this is a different setup.
Now it's not the flat and sturdy image instructions, but in a different setup apparently the multiple model did help. I think this is something that you're still actively figuring out in terms of research, like why. Downstream types of problems, multimodal data maybe more or less? I think for me taking the spin off specifically is where I look closely at how it's more important to have a diverse image distribution than it is to have a multi-modal distribution, but being able to train multi-model data makes it much easier to have a diverse image distribution. Excellent thanks. Any questions about YouTube, that's fine.
I will do that. I'll take over. One thing I was really curious about, so there haven't been many. activity on YouTube today, but one thing I was curious about Ludwig is if we take a step back and look at his broader research agenda, could you give us a little preview of what's next? I know you've been busy training a lot of open clip models and I'm sure we have a lot of exciting things coming up, but what else can we expect? You know, within a year or two, do you think there will be a lot more understanding of the data, do you think we will have?
Oh, I think there will definitely be new versions of Lion with improved annotations and better Downstream models, so I would expect to like it based on what we're currently working on, but there's still substantial room for improvement on the data side for models. of multimodal image text that will immediately lead to better Club models, but probably also the next stable diffusion version will be trained with this better pre-training data. Essence now, yes, right now, this is focused on Club. because we can evaluate the clip models very rigorously and I think that helps make all of these experiments grounded well, but it's totally true that a big consumer of lion imaging and anything on the imaging side Maybe Let's need a little more work on how to evaluate Channel 12's text-driven models.
I think once we have it we'll be able to use this as a guide for data set design as well, but yeah, one thing that I'm personally most excited about in the next year or two is improving the machine learning data infrastructure of open source, so the infrastructure is like building these big data sets that we train the models on. Yes, I'm looking forward to that too. I love the open. source The community is very welcoming, it's super fun and, I mean, it's been a total pleasure working with Lion Crew and I think there's a lot of energy, a lot of great ideas, super smart people, so I'm pretty optimistic, yeah, great.
Um, yeah, with that we were getting close to the end of the hour, so I just want to thank Ludwig again for coming on today and giving a really great talk. I think it's really cool for us to be able to see everything. the full view from Internet V2 days to reporting, I guess, reporting to the lion and opening the clip and everything you know when you present it, the end points and the investigation from A to B is really clear and um really inspiring for all of us um in terms of the seminar in class uh we'll be back next Wednesday um with Colin UNC raffle and hug face um so I think he's going to give a little talk on the kind of building MMO models as open source software um obviously something very near and dear to our hearts our hearts um and very very relevant to The Talk today um but with that we hope to see you all next week and for now We're saying goodbye to YouTube if people have questions feel free to email us and will be happy to answer more questions.

If you have any copyright issue, please Contact