Google Engineers On Learnings From Building Gemini

May 14, 2024

Well, hello everyone. I'm James Rubin. I'll be moderating this fireside chat with my esteemed Google colleagues, Peter Gabi and Peter Danenberg. We're going to touch on the key blockers and solutions for enterprise-ready film production and talk about some of the more modern approaches to really solving those challenges. We want to really focus on the practical applications of today and we hope that this is really a starting point for people, that it is a topic of conversation for people and for us. We'll be in the media lab later if we have any more in-depth questions, but I think before we dig in maybe let's start with some introductions.

google engineers on learnings from building gemini

I can go first. I'm James Rubin. I am a product manager at Google during a Gemini Applied Research Team. I work with Peter Gabi. I'm your product counterpart before Google. I was at Amazon as a PM for the better part of three years. I worked on the AI stack. I started with uh zuk, which is their autonomous driving subsidiary. but also on custom AI chips and machine learning services on AWS, and with that I guess I'll hand it over to Peter Janur. Hi, I'm Peter and I work at Gemini, who was previously bored and was an assistant before. uh Peter danenberg on a previously banned assistant previously um sorry Gemini previously banned previously assistant and I've been a senior swe for a while currently working on Gemini extensions and my name is Peter G Bowski.

More Interesting Facts About,

google engineers on learnings from building gemini...

I am lucky to work with James at Gemini. applied research team I've been at Google for about 10 years I came in through the acquisition of Nest then spent some time working on Google Assistant and then in the evenings I work as a teacher so I'm in graduate school data from UC Berkeley. science program and teaches deep learning with natural language processing and one thing I'll add there Peter is too humble to say it, but during the height of the AI craze in early 2023, he created and taught an llm boot camp that It's really become The Essential Course for Google that's been taken by tens of thousands of Google users, including me, so we definitely have two incredible experts on stage today.

Before we dive in, I want to discuss what the motivation for this talk was. A recent Andrees survey showed that the vast majority of companies adopting AI choose to build AI applications internally on top of common Foundation models and this, unlike what you know, is the B2B AI software available on the market , today there may be a Founder in the audience disrupting that trend, but regardless what we're most interested in is this disconnect between the enthusiasm to build it and this hesitancy and slowness we're seeing in companies to deploy movies externally into internal applications of production that we are seeing testing.

We're seeing use cases for employee productivity, but when it comes to external applications, they're comparatively left behind, so we really want to keep that in focus today and we want you to know to focus on ways that maybe we can make that The production process of these films is a little clearer, so with that I thought we should set the level with everyone and maybe start with the basics. Peter G. How would you describe a movie and why should companies be happy to do it? So I'll give a really simple example. but I hope it has a strong pedagogical value.

Movies are really nothing more than fancy autocomplete. You may have heard that metaphor before, so if I said or if I gave the audience a cue, I went to see a baseball game last night. see the Boston Red Socks, hopefully everyone is thinking about the socks, that's essentially what movies are doing now with short context windows or just a bit of data, that's not that interesting, but when you start showing these examples or give them billions of parameters. to learn and start showing them hundreds of thousands or millions or billions of examples with increasingly extensive context.

Windows is starting to see some really interesting behavior emerge and I think one thing we might want to expand on a bit is the fact that llms are not only good at predicting the next word in chat, but they can also be used for a wide range from traditional ml approaches, so maybe let's just break that down for ourselves and also talk about ways that tasks can be reframed to work with. L, yeah, that's one of the things that's super interesting, once you get into the parameter size of a couple billion, you start to see these fascinating properties emerge, so if you've heard the term zero learning or few opportunities, that's what people talk about. uh, if anyone in the audience has a 2 year old or is an ml engineer, this is the answer to the question you might have had: why can I show my 2 year old a photo of a zebra and show him that maybe two or three and then on the fourth it is able to correctly identify what a zebra is, but if with more traditional machine learning techniques you needed to show 10, 20 or 30,000 images of a zebra, I think that's what James is referring to.

What's really interesting is that you can start to rephrase a lot of these traditional ml problem framing as word prediction problems, so that if I give you a sentence and then I give you the message, this sentiment of this sentence is blank, will happily be completed with positive negative. neutral, happy, sad, and suddenly you've reformulated a classification problem into a next-word prediction problem. Awesome, so there's basically no excuse for businesses not to think creatively about how to apply LMS to their use cases. There are many space exploration solutions. Peter G. I want to get your take on this, especially how movies compare to these traditional machine learning approaches in terms of performance and how people use them, so one of the things we do every two weeks is bring new companies to Google to ask how. how they use AI, what are the difficulties they have with Gemini, this and that, and a couple of weeks ago I found out that a lot of startups are doing this, where they basically train a bunch of baby models that are kind of Gemma 2B model on things like classification tasks so they can go to market in something like six to eight weeks, whereas before, even just to train trivial models, a classifier would take six to six to 12 months, so we We've been seeing this rapid time to market using baby movies as trained ranking engines in maybe dozens of examples, which is incredible and actually this is a great way to recover from that slight digression that is the change in Llms uh, that makes them especially attractive to enterprises, yeah, that's one of the things that my team and I are really excited about, which is, all of a sudden, training a classifier to train a model of fixed quality, the amount of time that takes the amount of data that is needed the amount of expertise that is needed the amount of computing that is needed has decreased dramatically so that's one of the things that we are absolutely exploring on our team and I think there's actually a important thing we missed and that Customization is right, the ability to tune and align models for a specific task or domain.

Companies have vertical use cases, specific customer problems that they're trying to solve, so this is incredibly important and I actually want to dig a little deeper into that and understand it. The insights, the data, show that customization is one of the two main selection criteria for companies selecting a model supplier, but the process to perform customization is very complex, there are many different tuning techniques, there are many trade-offs Between quality and cost, it is very difficult to reach them. the outcome you want, so maybe Peter, starting with Peter G, gives companies a starting point to navigate this complexity, yes, 100%, so I've seen this in my own work.

I worked on Google Assistant for several years, one of the things we focused on

building

a sentence simplification engine for kids, so if you ask why the sky is blue and you're an adult, you might get an answer. like refraction in the ionosphere, but if you're a kid, that's not a satisfactory answer. I want something like it bounces off a drop of water that's in the sky, so I spent like you're saying 6 to 8 months trying to build a model, with Google research and releasing it to production, we were able to build something that works, but it wasn't high enough quality to ship quickly until a year ago with the few shooting techniques we were talking about.

I was able to build something that blew the model we had built 5 years before out of the water and so To continue, that's the advice I would give to companies: think about the problem you want to focus on and solve using a big model and then start , you can start by asking a model a question just like we were talking about in a moment. ago, so let's say you set up a Sandbox, you run an internal pilot, you have your metrics set up, your data collection, it works great on general tasks, but you're not quite ready to specialize in your domain, you're not ready to replace some of their domain-specific workflows, what are some more advanced approaches they can now take.

Yeah, and you mentioned one thing that I think is really important to emphasize, which is making sure you have metrics in place so you can measure when you're at. improving and so in this case maybe we can talk about a hypothetical example of using a legal startup, maybe you want your chatot or your agent to talk and sound like a lawyer, the first thing you can do is test what is known as Ro's indication, which is just telling the model to talk like a lawyer, from there I would re-evaluate the measurement, see how you're doing, and if it's not where you want it to be, there's a couple more techniques in that you can definitely try to dive into. those especially with regards to domain knowledge and domain specificity, so the next thing I would think about trying is what is known as a family of techniques known as domain adaptation, so the first thing you might try is some training continuous prior, so what you're doing is taking that language modeling task where you started predicting the next word, you're using backpropagation to update your weights, but you're focusing it on a corpus of data that is relevant to your model, for example in In this legal example, you could do continuous pre-training on a corpus of law textbooks to give a human analogy that is like telling a first-year law student to read 50 books of legal text and come back and talk to me more like an amazing lawyer, uh and what about things like classification, since we mentioned it earlier where or chat where we're talking about actually adapting the task that the llm performs instead of the domain?

That's a great question, so if you're asking the model to make a decision I don't know about some kind of case law or something. Ongoing prior training can be of great help. You may decide to focus on specific examples of the task you want it to perform and therefore whether it is a classification. problem, I would train it using B propagation using the following word prediction task to make that specific focus on that task and so in that context it is generally known as amazing supervised fine tuning, so to summarize the domain knowledge, the specificity of the domain continues pre- training is a good place for people to just start if they are looking to improve a very specific problem, framing a very specific task, sft is a good place to start, but, as we know, there are more than most , there are many.

Other techniques to explore within that, but we can talk about that maybe later Peter D. I feel like we've ignored it here, but the examples that Peter G gave us were with use cases where you can basically define quality and the precision. quite concretely, a legal chatbot that you can evaluate with an elsat classifier uh, you can measure accuracy and precision, what about use cases where quality is more ambiguously defined, yeah, it's interesting, so I don't know if this is a West Coast thing, but we had a bunch of startups come to Google a few weeks ago and they're trying to solve this personal companionship problem and it looks like there's a lot of venture capital money in that. , it's a west coast thing, by the way, you guys on the east coast, um, so.

Anyway, we did this little experiment right, so let's see if we can tune a Gemini model to be like Sherlock Holmes or Elizabeth Bennett, so we ran this experiment where you know we tune into Sherlock Holmes with about 10,000 examples just there. Was this phenomenonreally strange where this fine Sherlock Holmes didn't seem to know he was in a book, right? Then he would respond in the first person, which is a little strange if you do the same thing with vanilla. Gemini, you'll notice that you know Gemini talks like Gemini, essentially with a little Sherlock Holmes lipstick on, but you know one of the fun things was you know how to evaluate this fine Sherlock Holmes tune versus a vanilla one and You know, by the way , it's enough to say hello, homes, you know?

Do you live at 211b Baker Street? And, you know, it turns out that it's not, and you know that especially when you talk about these AIs as a kind of complementary domain, I think. There were a lot of these subtle issues, like you know, this character is likable, you know? Does this character scratch my itch? uh to get some definition of itch and uh, you know, um, and you know, for that kind of thing, maybe it turns out that you need a human being in the know to evaluate the tuned houses in this case versus the standard houses.

So you're telling me that tuning is a bit like a method of acting for movies? I think so, I think so, but you know and the funny thing is, by the way, you know that it turns out that you can find a tuna model with between four and 500 examples and in this case of houses, I think we take about out of 10,000 examples and it could have been that some of those examples were actually lower quality, so just to give you an example, I said, Hey, holes, you know? Can you tell me a little about rugby? and he said, you know, I can't tell you I've never played rugby, he basically said no, you know, I can't say it and I was just thinking, you know, if we had trained these houses with fewer high quality examples we might have obtained. a better result and there's a funny thing where more data isn't necessarily better, right, I think that's a great point and before we get to personalization, it's a topic because there's a lot more to cover.

I want to get out of there. this bubble of fitting an llm for a single task and maybe talk about how llms are being extended for more complex workflows where they operate asynchronously, synchronously, and even autonomously. um Peter G. I know you have a lot of experience with this if you could share your thoughts um so yeah we did this interesting experiment a few weeks ago where we trained an llm uh to be an asynchronous intraday trading robot and just to To prove that I had some skin in the game, I threw th000 at it and, uh, the funny thing is, I made about three dollars and so, I don't know if I can retire yet, but I have a 3% return, which is good, but The funny thing is that even out of the box, I use this thing called function call. you know, um, you know, the llm will actually learn how to act as an autonomous agent and to do that, we had to do some classification tasks, like, for example, these tweets that give these news headlines, are they bullish, are they bearish? and The funny thing was that every tweet was bearish, I'm not sure why and he always wanted to spend at least half of my money, but, you know, and one of the things I was thinking is if you did something. as a backtesting algorithm and maybe training the model with data from the last market year, you would get an even better result, but I was a little surprised what you could do out of the box.

Great, well I'm going to change this responsibly. conversation with the facts because I think it is very relevant. Facts are very important for companies. Recently, an Airlines chatbot was blown away by a fake refund policy. Now they are being sued as a result of that. Peter G, maybe you can describe to us what's happening under the hood when a chatbot hallucinates and then we can discuss some approaches to dealing with it, yeah that's a great question so I think what's happening is exactly what you're talking about. We were talking at the beginning of this talk, when the model is simply trying to make the prediction of the next word, in some cases the model is very sure about the next word, if you imagine a probability distribution over all possible tokens, you will get a really spiky distribution, in other cases it might be a lot less secure, so I think that's a dynamic at play.

The second dynamic that I think is really interesting is that in many cases these models are trained to be useful, so after the pre-training stage there is usually a phase known as instruction tuning where that is exactly what you are doing , you are training. the model you are instructing the model you are teaching the model how to be useful how to follow the results or how to give results how to follow the direction and so on in a case where the model is not sure, especially if you have tuned the instructions in instead of I'm just saying I don't know or I'm not sure the model could try to hallucinate something or make something up to try to be helpful and answer your question and what are some more advanced approaches people can take to deal with hallucinations .

Yes. There are a couple of things that we recommend and one is to use a technique that you all might be familiar with known as generation augmented recovery and the idea is that you want to use language models and a database together to solve the problem and so leave. language models do what language models are good at, i.e. generate natural language and you let databases do what databases are good at, i.e. store update, delete AAL data, then you train the language model to be able to retrieve the relevant information from the database and give a response based on that context and therefore in the airline example hopefully in that case you would have retrieved the actual refund policy and then maybe I manipulated it or summarized it or used that to answer the question, another hot topic is movie railings yes You could also touch on that briefly, yes, 100%, so this is a super important topic, it's not important just for generative applications, it's important anytime you're

building

a machine learning system, but the idea is that you often take a stochastic machine learning model that will always have a little bit of randomness and then apply a policy layer or a set of security barriers at the top and so on in the case of Gemini or, sorry, in the case of the day trading big language model. you could do something like, no matter how good the market looks, don't spend more than 10% of my money or no matter how good the market looks, don't put all my money into GameStop and hopefully that would tell you limit the output space you are thinking about and control the behavior a little.

There's a really interesting case where you can also use llms to help evaluate that policy and then you could have a layer on top that says "is this in the company voice and gives some examples of the company voice or is this This is a useful statement, it's a short and precise answer and that could be another way to use llms as a policy layer. It's worth noting that factuality and these approaches, especially guardrails, sometimes. they can take a toll on the user experience, um, Peter D, given your experience working with startups, especially these highly creative AI people, maybe you can share some insight into how this balance between feasibility and creativity is achieved Yes. , you know, this is this really interesting phenomenon, so you know, I noticed that when new companies come in, one of the first things they do with the llm is they disable all the security features properly and that's because that there's this weird optimization problem between security and utility and There are cases where, just to give you an example, someone wanted to do multimodal analysis on monuments and they couldn't about 75% of the time because there was like a human face in the image and that's one of these really subtle dances because I know that you know that it's possible, for example, that without realizing it, let's say, maybe, you adopt a toxic model, turn off the safety filters and suddenly, maybe This may be an embarrassing moment with your customers and anyway, there's a really subtle dance between security and utility and, and I think as a startup, maybe those are just some of the things you need to keep in mind when you go to market, incredible, for the purposes of the time I want. shift to data privacy because this is absolutely key for businesses.

I mentioned those two main selection criteria for model vendors. Data privacy is actually number one. I'm suggesting that maybe touch on what the basis of this concern is and what are some of the approaches that companies can take so that there is a long history of data privacy and machine learning going hand in hand and for as long as it exists. There have been machine learning models, people have concerns about data privacy, these may be well founded. Many of you may be familiar with the Netflix Challenge from about 10 or 15 years ago at this point and even with relatively limited output space, whether it's a ranking problem or a ranking problem, if people watch this, they might reveal a lot of sensitive information about the people in the data set, so the first advice I would give is to never train your model with sensitive data, whether it's a very simple classification model or whether it's a much more complicated generative model now.

I think the reason people are thinking about it so much, in the generative case, is that the output space in which you can produce is much larger instead of true or false, yes. or not, excuse me, the model can generate free text messages, so to motivate this concern a little bit, if I were to propose a model, you know, Peter Grabowski's social security number is, hopefully, it wouldn't be able to produce a Valid answer, now let's say. One company here has a product workflow that really focuses on sharing sensitive data. What are some approaches to enable that sharing of sensitive data without having to train on it?

Good question, one thing I would recommend is that augmented recovery generation framework that we were talking about a moment ago and that allows you to store this sensitive data in a database where it can be appropriately AAL and then at inference time in At the right time, you can inject them into the model and allow it to use them in your response, awesome. One thing I would add here is that they know when, for people who are concerned about data privacy regulations like gdpr Hippa rag is very complimentary in the sense that, with a database, you can delete data permanently and easily Of course, it's just a matter of deleting tables and uh rows and tables.

Additionally, you can localize that database to ensure that data is not transferred outside of a specific geographic region. Both are very important for things like gdbr um, although there is another lens for this, U Peter G, that I want to get. your idea of what distrust is in companies, I think especially new companies of close source model vendors and using their model because they are worried about logs from that sensitive data model being used to actually train the model from nearby source providers. No, he is absolutely sure. right, and I think to that extent you know startups tend to use something like a llama 2 stack, maybe a Gemma stack for a mile, uh, because they can run a couple of gpus and they control everything from start to finish, but it What I have noticed though is that some startups tend to use something called The Long Context Windows, an ad hoc form of rag and what that means is that there is a promiscuous mix of inference types and possibly training data, and that becomes dangerous when we're talking about things like the law and, you know, kind of insurance, so I think just having a rag is a form of data discipline, and even if you're running your own open source models , you can still have privacy issues if you're not careful, um, but I think that's also something that we're trying to do, you know, I know that vertex a is trying to basically be the, you know, one of the arguments that they're making. is that your data is safe with Google, right, and you know, I think that's at least how we're trying to differentiate ourselves.

Great, look you two my brilliant colleagues, there is absolutely no way we can cover all the necessary information on how to produce in llm. in 25 minutes, but you've done a really good job and I just want to thank everyone for listening. I want to thank Peter and Peter, and if you have any follow-up questions if you didn't understand something, we'll be in the media lab, so feel free to come up to us and ask us questions. I hope you enjoy the rest of the program.

Watch Video & Subscribe

If you have any copyright issue, please Contact