OpenAI v2 Embeddings + Search + Context Enhanced Completions

Apr 02, 2024

So in this video, we're going to cover a few things first: It's all about Openai's new integration model. If you're not familiar, Openai announced on December 15th an entirely new integration model that greatly increases efficiency and reduces cost. It's fantastic, so this I'll explain to you how to use that new model. I also want to give you a concrete example with real-world data, so for this I'm curating markdown documentation for what we do in Switzerland, integrating many of the big tech companies. Microsoft Amazon GitHub. to name a few, they have their documents in Markdown format or at least some subset of their documents, so it can be a really rich data set to use and play with and then also what we're going to do is once we've created the Embed initial and then used

search

and cosine similarity, we will provide the result of that

search

context

for a completion call, so taking the DaVinci text zero zero three will give you that additional

context

when someone asks their question and essentially allows us to get a similar result to what you would get for something you would normally need a fine-tuned model for if you have some body of knowledge that has data that you know is outside of open AI models and you want to be able to ask questions. and you want to be able to make the chatbot return very human-sounding responses so that you have your data, there are two options: You can tune your model, which can be time-consuming and very expensive.

Another option is to basically do a search on your internal knowledge base and then be able to provide that as context at the time of the completion call, so we'll go over that a little bit to make it a little clearer. I also want to go over kind of just to give you an idea that you're working with small, medium, and then I say large data sets, because big in big tech is very relative, so in this case, big in this case. It will be millions of tokens. of the things you'll start to think about, but there are certainly much larger data sets than what I'm going to go over here, in the same vein as when you use these models, you think about the initial cost, you think about how you scale, and you think about The speed limits that will come into play depending on the amount of data you are sending and how fast you are going to send.

More Interesting Facts About,

openai v2 embeddings search context enhanced completions...

To start with, I have this little script that I wrote and the purpose behind this is that I have a folder that has some markdown files and we will show you this folder, the markdown files are not just in the root, but they are in subfolders, so so this is a very basic script to follow and regardless of how many subfolders there are, it will stick. digging deeper until you get to the end every time it finds a markdown file, then it will process it a little bit and add it to a list that I'm creating and it will be able to go and add it to a data frame and so the way it what I'm doing is that each file will essentially become two rows in a data frame, one with the file name and then run another with the full text of the contents.

I'm doing a little kind of pre-filtering it where any of these files, if you haven't seen one before, some markdown files have a particular structure and in this case we have metadata at the top. I don't want to include metadata because of the token limit, so I'm just searching for these and extracting that information, so I basically have a little search string that I'll search for every time I have them. I can use a small counter to make sure I'm not capturing the metadata, but all of it. Certainly also another line from the Dock, if you were working with release based files that are not in Markdown, any file that is human readable in Notepad, you could adapt the script to be able to use it, so if you have a lot of text. files, if you have a bunch of HTML files, you can change the file extension, you can add additional file extensions and then certainly just delete or modify the checks that are done and extract metadata, so we'll run this real quick, it will runs pretty fast.

We'll generate a data frame based on that and now we have 27 articles indexed by zero file name in this column and then here is the full text of the dock. You can see we have some new lines and some things we want to extract. I'm going to do some basic normalization, this is certainly not exhaustive. I'm also going to remove the pound sign just because it's used in headers within Markdown. Ultimately you'll want to do a lot more cleanup than this, but this is just to give you an idea, we'll run this cleanup if you've used

embeddings

in the past, you've probably used the Hugging Face Transformers library for tokenization, which has its own tokenizer. gpt2 gpd3 now with the new embedding models that are from

openai

you need to use the CL 100K base tokenizer.

This is part of a new library. I'm not sure about the pronunciation if that's Tick Tock in tick token or Tik token. We'll have to ask Open AI, but if you do. pip install we'll say tick token, then you can import it and use it, it's pretty easy to set up, so we'll use it here, very similar to the previous call we used to do, only in this case we have to use different library if ever you've done embedding and make sure you use this tokenizer and we'll apply it to our data frame so now we have this column of final tokens and this is the number of tokens based on size.

From the content of the article itself, one thing I like to do is look at the price, so some of these prices are not going to use the old V1 models, but I think it may be useful right now as people may have a code that is running the old V1 models to understand and appreciate how much cheaper it is to do things with the new Ada V2 model, so we'll run this and all we're doing is just taking this and the token column that we're summarizing it and then doing some dough based on the prices on the

openai

site to make this nice table and see the approximate cost of things so you can see even this relatively small data set of 27 items.

The DaVinci V1 model would have been $11. everything we had before, which is awesome and the performance is usually as good as DaVinci in embeds and embed search, so here we're going to generate our embed, so we're just going to take the text that we have in our column of content and let's convert it to a bunch of floating point numbers that are a representation of that data, so we'll run this and voila, there's our data frame, as you can see here, there's the Ada V2 embedding column, so which these are just big floating point numbers representing. this content now we can do a search as well, so this is kind of a standard search.

You can see this is also used, so for both embedding and searching we can now use the ada002 text embed before we had to use a different one. model for each and we will use cosine similarity based on, so we will generate an embedding of any question that is answered and here we just take the user input, so we will create a small question box interpreter notebook. a question is answered which is then passed and the embedding is created and then that embedding is compared by cosine similarity to all the

embeddings

that we already have from the content of our documents, so let's run this and we will return the top three results and now we get In this little box of questions, I'm going to ask you a question that I know we couldn't answer any other way without using our content, because we'll get into it later, so we'll say how we actually say it.

Azure Open AI support because customer managed keys aka cmk and question mark and since I don't need to press them because I'm using a text I just need to press Enter because of the input field and so they can see it came . Going back to our top three results, you think based on cosine similarity, this is our best bet. This article is about Encryption at Rest and Client Edge Keying, so that's what we would have wanted, and your second Beck's Choice was a news article that actually announces support for Client Matching Keys, so both They would be pretty good options if we want to specifically see the content of this first top result, which is essentially the entire body of the article, so you can see it here as well. things we'd like to filter out, we could certainly reduce this result to something smaller in the phase where we're actually going to embed this and before we run the embeds, but for now we're just having a one-to-one relationship with an item will have everything your content instead of breaking it down into smaller, easier-to-manage chunks.

What we're doing here is we're going to do the same search that we did before, we're still going to ask a question, but now we're going to add some other fun stuff, so we're going to pass in the result that we get so that what you see as the result main is passed to a variable. we call ourselves context we're going to use DaVinci text zero zero three to do a completion with openai we have a new initial because we're doing a completion call now we also have our initial message that you could do if you were just trying to make a chat bot with openai, so it's very simple to follow a conversation with the assistant, helpful, creative, smart, friendly and then we're going to create a combined message which is our initial message plus the content we got from our search, which in this case will be a complete article and then plus Q colon and then the question is the question of whatever it was and therefore we need to do that as well, so this is an AI question, so this should be fine.

I question and we will write our Question again: Azure Open AI support customer managed to keep this question mark and log in? And you see now, instead of before, where we just came back, like this list broken down by some similarities, we now get a real answer, because we provide that context. text DaVinci zero zero three is able to answer questions answer a question that I know is not in your model, but it is able to answer as a model would do fitted with the correct real information, you can see that in this case it says you You can get more information by consulting this documentation page.

We're not actually referencing that the documentation page doesn't have that, so there are definitely things to improve in that dataset or include the link to the article we were using for context as part of it. from this answer would certainly be another option if you want to see the full combined message that we're putting together here from these things, we'll go to Quick Print so you see our initial message to follow as a conversation and then the context of that Doc and then at the end , our two Q points with the question attached, so what I want to do next is go through the same process again but with a slightly larger data set, so what we did before was 27 articles, now I want to review the full set of cognitive services data found in things like computing coverage, computer vision, deep learning, form recognition, anomaly detection, also open AI content, which is just a set largest data, I think it's about 1800 markdown files and we'll see something like that. how the script is handled while we and how it calls the API how we have to think about things as we scale to larger and larger data sets, so we'll run like we did before this, which will simply take all the markdown files that you can find in this directory or any subdirectory, it runs pretty fast, it creates a data frame so you can see we now have 1842 items, there is all the contents of them again, as before, let's clean up some new lines and some others things, but ultimately in a real scenario, it cleans this data a lot more.

We will use our new tokenizer as we did before. The other good thing about this is that we get increased maximum entry tokens, whereas before we were limited to 2046. Run this. Alright, now we have our column of n tokens that you can see when we have a larger data set like this, pandas will truncate and not show us everything so we don't have to scroll indefinitely, which is very nice. I think we actually have other data here between four and 1837. I'd like to know how much this is going to cost because it's on my personal open eye subscription, it's nothing corporate, so again you start to really see the cost savings of using the new Ada V2, so for more than 1800 items it will cost me a dollar before, for DaVinci it will cost meit would have cost 500 before, even for Curie it would have cost me fifty dollars, which I probably wouldn't have bothered.

To do this experiment at 50, at least not on my personal account, we're going to grab the denied frame links real quick, we're going to remove some of these if they go over 3500 tokens just to technically minimize this number. It's going to be a little bit lower because the total token count now if the items would be lower, but this just gives me sort of a baseline. I could certainly run this again, see it in tokens, but for me I'm fine with that before. we just went and then ran out and generated our embeddings but now we need it at least since we have this larger data set we need to start thinking about rate limits so for me if I was a test user free, I would absolutely have to do it.

I'll be thinking about rate limits because at least depending on how I'm doing things with how we did things in our mini example, we're not doing any batch processing so we're sending one request at a time very quickly because when you do a apply in pandas it does it pretty fast, it still does it one at a time, so it's not parallel processing or anything, but it still does it pretty fast and gets those answers, so free trial users, yeah If they tried to run the code we had last time, they would hit a speed limit and pay as you go would be a bomb.

For me, I just created this account on Openai in the last 48 hours, so I'm going to hit this rate limit of 60 requests per minute because I'm not doing any kind of batch processing yet, even though I can technically do 250,000 tokens per minute because I'm going to be doing things. I'm only sending one request at a time, so this is basic code to handle that, so if you want to do something like that, you don't have that big of a data set. which is getting into the territory where you need to group and have a single request that actually has multiple embeds, we can fix that with this, so this is and also certainly.

Once this 48 hour window passes, I would still do it. would be fine with this particular data set because then it would go to 3000 requests per minute and since we only have, eventually we will send 1700 requests, it would be fine, this is not the most complicated code, but we just have On some counters we set our rate limit here . I have this IPython screen. I didn't want to print and have the screen go down forever, so this is basically clearing the screen every time it goes through the loop. just so we can watch the numbers go up and see what's happening, but we don't have to deal with anything else.

This code here. I'm going to comment real quick just because it was a dumb mistake I saw. What was happening was not intuitive to me, so I'm going to force it to have this error, but you can see that once we hit our speed limit, we'll sleep for 60 seconds so they don't. Don't turn us off and we'll continue from there and you can adjust this as needed. I'm going to run this and so it's going to fail because there's something in my data that I need to clean up. I'm going to do that right now, but in case you see this error because when I was searching I didn't find any answer to what was happening and it's a pretty obvious error, but what it says is blank is not valid under any given input from the schematic and I can see that it's doing my embeds and ultimately what this means is that somewhere in here in these things that I don't see because the pandas are nice and hide them from me for easy viewing, I have to have some content fields that are blank which means there is some problem with some of my input scripts or there is just some exception that I didn't think of, maybe there are files that have no metadata so I am looking for that and then throwing things I need to go back and look but essentially for this to work I can go and say if my content field or in this case it's called text if the text is blank just type the word in blank and then print to the screen in blank content field detected, ultimately you'll want to clean your data because that means the way I'm doing it, I'll be charged for a handful of embeds for these fields that weren't processed correctly.

I know there are a limited number of them, so I'm really okay with that. I'm going to go and hit run and you can see that it detected some blank fields in there, so the script would have failed before this. We are now sleeping for 60 seconds because we sent our initial 60 requests. I'll stay with this for a moment. I'm not going to continue to the end. You'll get the idea, but this is just to give you an idea as you start climbing. a bit larger data set you need to think about how you are handling those embeds and so you can see what happens once we finish sleeping we reset our total request but our mini request counter will hit 60 and then the script will sleep again, we don't need to see all of this and essentially the rest is the same as what you saw before, but this is just to give you some very basic code if you need to deal with speed limits but we don't necessarily want to do some of the more advanced solutions that are going to be very broad, so in this case what we are going to do is that in Temp Storage we have all the articles that exist in the Azure documents, so basically Microsoft has its The most popular repository it seems is or one of its most popular repositories is Azure Docs, so I have a copy of that locally that I'm going to go through and chunk so we can get an idea of what it's like when we're playing with a larger data set, so we'll back up each markdown file of each item.

I don't necessarily recommend forking the entire repository unless you're interested in contributing to actions because it's quite large, but we'll do it quickly. and process all those files, this one will take a little longer, but we'll see how long it takes our script to handle this if you're wondering why we use line.find instead of there are many more. detailed ways to do what I'm doing in Python is because find is implemented in C and therefore runs very fast, so it didn't necessarily matter when we were running our mini version of this or even the medium version to which we just need to go. across 18,000 files, but in this case we're going through a lot more files, so I need something that can really go through and process all those files quickly, well, now we're complete, so not bad. 316 this is in seconds for you to Review and review every Azure database that Microsoft has because at least it's in Markdown format.

Technically, there are also yaml documents for FAQ articles and other things. We'll put them into a data frame and now we have 23 and 878 rows. which means we have 23,878 items, we will do some normalization on that and we want to tokenize this like we did before. This time I'm going to go and see how long it takes. I'm curious, if that's okay. that's not bad, so 49 seconds to review 23,000 articles. I think it's pretty good, let's see what it's going to cost, and this is where you can really see the benefit of the new model, so $21 for me. and do an embed of all the Azure documents, but if I had been using the old Da Vinci V1, it would be ten thousand dollars.

If I had been using Curie, it would have been a thousand and actually any of these numbers, many of the similar ones. The hypothetical use cases that I can think of don't really work for a data set of this size with embeddings, but 21 there might be some things we could do with that that would be interesting. I'm not going to go there, I mean, we can go crazy. take the length, I'll drop anything over 5000 tokens, so we go down a little bit, um, but still, that's the total number of items, but again, if you look here and let's take this to notepad because I need commas 53. million tokens, that's a decent sum, so if you look at my crude way of handling rates, we won't reach 53 tokens even if within 48 hours, once it reaches the capacity of making 3000 requests per minute, which is As is the maximum of what this simple model allows us to do, we won't get to 53 million anytime soon, so what you need to do is think about your tokens per minute of 250,000 tokens, which You can do this by batching so that there are Some Open AI articles in their cookbook on how to deal with speed limits, which gives you some ideas on how to start dealing with this.

There's also another thing to think about when you have such a large data set or certainly a much larger data set. you want to be able to retry, you want to be able to do exponential backoff and so the idea right now with the current incarnation of the script, even if we're using a much smaller data set if for some reason something goes wrong the API we're going to lose everything even on a smaller data set with this type of script, so really what you want to have is the retry logic that, as you get it and embed it, you keep it somewhere temporary instead of the way you what this works with Panda's application feature is that we are not actually writing to the column until the end if it has done so and has generated embeds for 90 of its rows of data in that column and if it hits a record that has an error , you'll lose embeds on everything from then on unless you've enabled some kind of retry logic, so that's where you should definitely do it when you start scaling, and see how to handle it. rate limits, they also have how to handle parallel processing, so again, if you're in that kind of realm of millions of API requests, here are some ideas on how to handle that, this is actually what I wanted to do, so we start with our miniature.

Kind of a toy model showing how this works with some real world data. We went to sort of a medium-sized model that shows that this is the limit of what we can do. A very basic speed limit handling within the r script. and then it shows up once you get to a large data set and once you're in the neighborhood of 53 million tokens, you really need to get to the point where you're doing batch processing where each request has multiple embed requests, um buried. on it, as long as you keep it below the total of 250,000 per minute on the token permanent account, without a doubt, if this video is useful, please like it, it helps me get an idea if it is worth investing time in This is something I do. in my free time, thank you very much for staying and I hope we will make more videos soon, thank you.

Watch Video & Subscribe

If you have any copyright issue, please Contact