Live Coding A Machine Learning Model from Scratch (Google I/O'19)

Jun 06, 2021

Hello everyone, welcome to

live

code a

machine

learning

model

from

scratch

. My name is Sarah Robinson. I'm a developer advocate on the cloud platform team focused on

machine

learning

. You can find me on Twitter at s rob tweets and more recently you can find my blog. at Sarah Robinson dev, so let's dive into what we'll cover today. I'll start with a quick overview of what machine learning is, then talk about the

model

we'll build and finally get to the

live

coding

, so at a high level what is machine learning? I really like this definition that uses data to answer questions.

The idea here is that as we feed more and more data to our machine learning systems, they will be able to improve and generalize. in examples they haven't seen before so we can think of almost any supervised learning problem this way, we have our training inputs labeled, we feed them into our model and our model generates a prediction. Now these training entries really could be anything. It could be the text of a movie review and our model could be doing sentiment analysis to tell us it's a positive review, it could be numerical or categorical fitness data and maybe our model is predicting the quality of sleep you'll get. we will get, it could be image data, so in this example our model predicts that it is an image of a cat, so this model concept may seem really magical, but it is actually not magical at heart, everything It boils down to matrix multiplication, so if any of you remember, y is equal to MX. more B from your algebra classes, this may seem familiar, so the idea here is that you have your features as matrices, those are your inputs and then you have what you are trying to predict and then you have these weight and bias matrices and when these They are initialized, they are initialized with random values, so the idea here is that as you train your model, you will find the optimal values for these weight and bias matrices to get high precision predictions.

More Interesting Facts About,

live coding a machine learning model from scratch google i o 19...

That sounds great, but you might be thinking of examples. I showed in the above that none of those had arrays anywhere, well it turns out you can represent pretty much any type of data as an array, so this image as an example, all that really is an image is just a bunch of pixels and this is a color image, so each pixel has a red, green and blue value or an RGB value, so this image becomes three arrays of the RGB values for each pixel. Let's say you have categorical data, so for a particular column in your data set, it describes the industry you are in.

The person works and there are three possible values. The way it encodes categorical data is through what is called hot en

coding

. To do this, we create an array with the number of elements corresponding to the number of categories it has, in this case, three and all the values are zero except for one and the index of that corresponds to the value of that category, like this which you might be thinking, wouldn't it be easier if we numerically code these healthcare finances to sell three? We could do that, but then our model would interpret them as continuous numerical values, so it would assign a greater weight to retail.

I would say that retail is bigger than finance, which is bigger than healthcare, and we want our model to treat all of these inputs equally, which is why we hot code. them, so I've shown you how we can transform our data for our machine learning models in this talk. I want to share with you some

google

cloud tools that can help you build ml models and I like to visualize this as a pyramid where you can choose your level of abstraction if you want to delve into the details of matrix multiplication and build all the layers to your model from

scratch

, which we will do today.

You can do it, but you also don't need ml experience to get it. started, so the tools at the bottom of the pyramid are aimed more at data scientists and ml engineers and as we move towards the top, they are aimed at application developers who may not have machine learning experience, for what in this talk we will create a model with tensorflow and deploy it on the cloud, an artificial intelligence platform, the idea here is to turn everyone into data scientists and ml/h engineers, so when I was thinking about the type of model I wanted to build for this, I wanted to choose something. that will resonate with developers and as a developer I can think of one school that I use on a daily basis and that is Stack Overflow, so I wanted to create a text classification model to see if we could predict the tags of a Stack Overflow question.

To do this we need a lot of data and fortunately we have a public dataset available on Google Cloud Bigquery, which is our big data analytics warehouse on Google's cloud platform. McCrory has a lot of really cool public datasets for you to explore and play with. It turns out they have one of the Stack Overflow questions and it doesn't just have a few Stack Overflow questions, it has over 26 gigabytes and over 17 million rows of questions, so this is a great place to start since it doesn't we have many. For a long time I wanted to simplify the problem space, so in this example we will only classify questions that have these five tags related to data science and machine learning, so for this particular question our model should categorize it as pandas , which is a Python library.

For data science, our first step is to get the data from bigquery. bigquery has a great web UI where we can write a sequel directly in the browser, get our data and then download it as a CSV, which is what I did, so I'm extracting the title and the question body and concatenating them into a field getting a string of tags separated by commas and then we just get tags with these questions with these five tags, so I ran this query and as I was looking at the results I noticed something, a lot of the questions, the question banners will conveniently put the name of the framework they are using in the question, which is very useful and made me think: do we really need machine learning for this?

Could we just replace our entire ml system? with an if statement, if tensorflow is in the question tag, sensor foil same, the answer is no because although a lot of question posters ask this, there are a lot of people who ask questions that are really good, but they just dive right into the code and they may not mention the framework or the name, so we want to capture these questions as well and we don't want our model to just catch those signal words so that it only identifies the questions with the name with the word tensorflow in them as tensorflow which we wanted. to be able to generalize to find patterns between these labels, so when I was preprocessing the training data, what I wanted to do was remove these obvious keywords, so I thought I could replace them with a common word and since everyone loves avocados, I replaced all these words include abbreviations for the frames with the word avocado, so the results in bigquery look like this, we have our questions, we have avocado predictive models, we have avocado data sets, lots of avocados everywhere, like this What I didn't talk about before was how we encode free-form text data into arrays.

There are a couple of different approaches to doing this. I'm going to use one called bag of words and this one is really simple to get started with, so you can think of each input to your model in a bag of words. like a bag of Scrabble tiles, whereas instead of a letter on each tile you have a word on each tile, so this type of model cannot detect the order of words in a sentence, but it can detect the presence or absence of certain words to show it to you. how this works I want to show you a really simple example, so for this example we'll further limit our problem space to say we're just tagging questions with three types of tags and bag-of-words models have a concept called vocabulary, so Imagine for a moment that you are learning English for the first time and that you only know these ten words.

This is how our models will see the problem, which could lead to some interesting conversations whenever we only know these ten words, so when we take this information. question how to plot a data frame bar chart, we'll look at our vocabulary and say, okay, I recognize these three words, the rest of the words in the question will be gibberish to the model and when we feed this, all of our questions in our model we want to enter them as arrays of the same size, so what we do is our question becomes an array the size of our vocabulary with ones and zeros indicating which words in our vocabulary are present, because the data frame is the first index in our vocabulary, the first element of our array becomes 1, even though dataframe is not the first word in our question.

The same goes for the graph and diagram, so to summarize, our question becomes a vocabulary-sized array of ones and zeros, it's called a multi-hat. encoding and our prediction, since for this particular model our models will be able to identify a question that has multiple tags, not just one, so it will also be a multiple array, so now we know how to encode our text data for our model . our model actually looks like this, the input data to our model is going to be that vocabulary sized bag of words matrix and then we feed it into what's called hidden layers, so it's going to be a deep neural network, which means that we have layers in between our input layer and our output layer, so we're going to take this vocabulary sized array and change it to whatever size we choose for our second layer and our third layer.

Now the output of these hidden layers doesn't mean much to us, our models use it. to represent complex relationships, but what we really care about is the output of our final layer. There are many options for how to calculate the output of this layer. We're going to choose a sigmoid call and what it will do is return a value. between 0 and 1 for each tag, which corresponds to the probability that that tag is associated with this question, so for this particular example it seems that the question has a high probability of being about chaos or tensorflow, so What are all the tools we are going to use? to use to build this, I already showed you how we use bigquery to collect our data and downloaded the CSV.

We will use 3 open source frameworks to do some preprocessing and some transformations on our data to get the correct format. use pandas scikit-learn and we will use tensorflow to build our model specifically TF chaos we will run our training and evaluation collaboratively collab is a python notebook hosted in the cloud and you can run it in the browser, it is totally free for anyone to use and finally we will deploy our cloud model, an artificial intelligence platform, so the title says live coding, let's move on to the demo, can we switch to the demo? Great, here we have our collaboration notebook, it's connected again, anyone can access the collaboration in the browser. collab dot research Google com and as you can see we don't have any code here, which could go wrong, just a bunch of comments, so I'm going to start writing some code and running them cell by cell. use is collaboration has this handy snippets tool so I saved a couple of snippets for this notebook that I'm going to put here so the first one is we're going to run this anyway okay I had to reset the runtime and now we're connected, so our first cell just imports all the libraries that we're going to use, we're using tensorflow pandas and, uh, hello, a couple of scikit-learn utility functions and then we use care us to build our model, so I've got all of our imports, let me make it a little bit bigger so everyone can see the next thing we want to do is authenticate here we go, so Collab also has a handy authentication function that we can run and what this will do.

What we do is a URL will appear for us to authenticate to our cloud account, so I'll allow access, copy this code and paste it. Now we're authenticated and now we're going to move on to the fun stuff, so first we want to download. our CSV, so now that I'm sonicated, I've saved all this data to a CSV on Google Cloud storage, which is our object storage tool on the Google cloud platform,so I'm going to download the CSV to my local collaboration instance and then what I'm going to do. What I need to do is use pandas to read this CSV, so let me read it here and this will transform our data into what's called a panda data frame, which you'll see in a moment, the next thing we do.

What we're going to do is shuffle our data and this is a really important concept in machine learning in case your data was in some kind of order before you want to make sure you shuffle it and I'm using scikit-learn. The shuffle function to make that data point will allow us to preview our data, this is what our data looks like. We have our tags as comma-separated strings and we have our question text with a lot of avocados that we can't feed now. this in its current form for our model, so we're going to need to do some coding, so first we're going to deal with the labels, so the first thing we want to do is code these labels like we saw on the slides in five hot multiples. arrays of elements because we have five possible labels, so what I'm doing in the first line is splitting each label into an array of strings and then I'm using this multi-label ER binarize function from scikit-learn which will take all of those arrays of strings and transform them into a multi-hot array, so this is for the first question, we can see that it is about tensorflow and take care of ourselves and here is this reference array that Saiki students created for us, so our input is becomes this so we have coded our labels and we are almost ready to move on to the questions before we do that we need to split our data so another important concept in machine learning is train test splitting so we take most of our data and we will use Es to train our model, in this case we will do 80% and reserve a smaller portion of our data.

It is used for testing so we can see how our model performs on data it has never seen before. In this case we have one hundred and fifty thousand questions in our training set and thirty-seven thousand in our test set, so now we will split our labels in training and test sets and we will say that the test labels are the same as the encoded labels that we will run. That, then we have our tags ready to go and the next thing we want to do to get our tag data ready to go now we need to encode our question data into bag of words arrays so I wrote a class to do this, I I'll paste here, it seems like there's a lot going on, but I'll explain.

What's happening here is that we're using this chaos tokenizer utility, so luckily we don't have to do all that code to convert our free form text to bag of words by hand caris has this utility that will do it for us, all what we do is we pass it or vocabulary size, so in this super simple example that I showed before we had a vocabulary size of ten, so take the top ten words from our data set, since this data set is much more big. I chose a vocabulary size of 400 which you will see in a moment, so you want to choose something that is not so small that it only identifies common words across all. your questions, but you also don't want to choose something that's too big that your bag of word arrays becomes all ones, so we're going to run this and now we're going to use that class, so we're going to import it and split it. our questions for training and testing sets, so I'll do that right now.

There are still strings that we haven't encoded yet. Okay, the next step is to instantiate a vocabulary size variable which is going to be 400 and now we're going to actually tokenize. that text, so we're going to create a variable called processor and then she ate that text preprocessor class by passing it our vocabulary sighs and now we can call the processor dot create tokenizer and pass it our trained questions, that's this method here and finally The last part is actually creating those bag of words arrays, so this is going to be equal to the renderer point transform text, that method right there and we'll pass in our trained questions and then the test set will be the same with our test questions that defined here, so I'm going to run that will take a little bit of time to run because what it's doing is transforming all of our 180,000 text questions to a bag of words, so while it's running we can keep typing the other cells and It will run when this completes so what I want to do is print that the length of the first instance should be 400 and then we'll just record that bag of words array so you can see it and now let's start building our model, we'll go back to this when it finishes running, so the next thing we want to do is save the tokenizer that we created because we will need it when we deploy our model.

We're using a Python utility called pickle to save that tokenizer object to a file, so now it's time to create our model. We have transformed our data. We have saved our tokenizer. Now we actually want to write our model code using Kerris. I'm going to wrap this in a method that we're going to call create model and it's going to take our vocabulary size and our number of tags and we're going to use the chaos sequential model API to do what I imported earlier. This is my favorite API for building models because it essentially allows you to define your models as a stack of layers so you can see that the code we are about to write will correspond very well with the model diagram that was just shown, so it looks like this cell finishes running and we can see that this is what the first bag of input words looks like for our first question made up of ones and zeros, so let's continue building our model so that the first layer of our model is a completely dense layer connected, which just means that each neuron is an input layer. is connected to each neuron in the output layer, we need to tell our first layer what the input form will be, in this case it will be the input of our vocabulary size, which is 400 and finally we need to tell us how to calculate the output of this layer, the good thing about chaos is that we don't need to know exactly how this activation function works, we just need to know what is the correct activation function to use, so in this case Ray Lu is what we need.

What we will use will add one more hidden layer as well with the trigger rail feature. We don't need to enter the shape here because it will infer it from the previous layer. So far, these are our hidden layers, so we don't care too much. about the output of those layers, but what we do care about is the output of our final layer, when that layer will have five neurons, it will be a five element array since we have five labels in our data set and the activation function here It's going to be sigmoid, so that's really all the code for our model, just four lines of code that we've defined our model.

We need to run the model build so we can train and evaluate this model and to do this we need to tell it a There are a couple of things so we need to tell it our loss function. This is how Kaos will calculate the error of our model, so every time he runs the training he will use this function to say: what was the error between what the model predicted and the terrain? True, what I should have predicted again, we don't need to know exactly how this particular loss function works under the hood, this is simply the best one to use for this type of model.

I also need to tell you that my optimizer is how the optimizer works. The model will update the shape after going through a set of data and finally I want to tell you how to evaluate my model as training and we will use accuracy as a metric here so I will return my model and now We will actually create our model so we will call create model, we'll move on to our capsized ship and our number labels and then we can use this model point summary method to see what our model looks like layer by layer so we can run it.

We have our model ready to go, we haven't trained it yet, so the next thing we're going to do is train our model and we can do that with a method called just model dot set, we pass it a couple of things that pass it our bag of words matrix, that's our input, our features, we pass it our labels, those are the labels that we've coded, we need to tell it how many packages to run the training, what this means is that this is the number of times our model will iterate over our entire set of data, so we will go through our entire data set three times.

The batch size is how many elements our model will look at at once before updating the weights. In this case we'll use 128 and then I'm going to pass in an optional parameter called validation split which will take 10% of our training data and evaluate it while our model is training, so we'll run the training and it should train pretty fast and what we ideally see here is that our validation and our training loss is decreasing, so that's a good sign that we have our validation and our training loss decreasing. If your validation loss increases as your training loss decreases, it could be a sign that your model is learning the training data too closely, we can see. that our accuracy is 96%, which is pretty good, so the next step is to evaluate our model, so the 20% of the questions that we left out now we will see how our model performs on that data, so we call model evaluation and we pass it our test bag of words our test labels and we tell it our batch size and our evaluation accuracy is very close to our training accuracy which is a good sign and then the last thing we want do is save our model to a file caris uses the h5 file format so we'll save our model and now we'll test it locally so we'll have a custom prediction class that I wrote and all this will do is preprocess our data to create a instance of our model. our model saved from the file will instantiate our tokenizer, take the question as text, transform it, and then return a prediction which is a sigmoid probability matrix, so let's save some questions to predict this is just a matrix of two questions from sample and then We'll do a prediction on our local model, so this should come back great, so for the first question it predicted chaos, which was accurate, the second one it predicted pandas, so we have our model running locally.

The accuracy is 96 percent. The next thing we want to do is pack. and implement it on the AI platform, so I wrote some code to do this and one of the features that we will use that is new to the Cloudy I platform is a custom code that allows us to write custom server-side Python code that runs. At prediction time, this is really useful because it will allow us to keep our client super simple, so our client will simply pass the text to our model, we don't have to do any transformations on the client, we will pass our text to the model and on the server we will transform that text into a bag of words and then return the correct tags, so to do this we will need to copy some files to Google Cloud storage so that the Cloudy I platform can find our model and our tokenizer.

I'm going to use Google Cloud CLI G to set up my current project that I set up for this demo and eventually we want to deploy this to the AI platform, so here's the deploy command that we're going to run. We'll set our minimum nodes to one so our model doesn't scale to zero and I'll create a new version called IO 19 and we'll deploy it, that'll take a couple of minutes to run, but if I look Here in the Cloudy platform model UI I, we can see that my model is deploying, which is pretty cool, so while that's deploying, let's go back to the slides to create something that looks pretty good.

We've written some code to preprocess our data, we've trained a model with fairly high accuracy, but can we do better at this point? Our model is practically a black box. We don't know how he makes predictions. We know the predictions are accurate, but we don't know what it's doing. It comes to those conclusions, so as Sundar mentioned in the keynote, we want to make sure that our model uses data to make predictions in a way that is fair to all users of our model. To do this, we must discover the black box and, as model builders.

We need to remember that we are responsible for the predictionsgenerated by our model. This may seem very obvious, but we must say that we take steps to avoid bias by ensuring that our training data is representative of the people who use our model. Luckily, there are many great tools to help you do this. I'm going to look at an open source tool called shop and what shop will allow us to do is interpret the output of any machine learning model, so we'll use that here for our TF chaos. model, but we could use it for any type of model.

The way the store works is we'll create what's called an explainer object that passes in our model and a subset of our training data and then Shapp returns these attribution values, so what are the attribution values? Its positive and negative values indicate how much a particular feature affected our models prediction, so let's say we had a model with three features, we get a three-element matrix of attribution values, high positive values, I mean, then that feature in particular drove our prediction to negative values. it would mean it lowered our prediction, how will this work for our bag of words model?

So what I set up to do the store was to treat that 400 item vocabulary matrix as 400 features, so we get 400 different attribution scores and we can take the highest values of that and the lowest values to see what words had the biggest impact on our model, so I wrote a little code to do this. I don't know if the colors display well there, but what this is is that I took the five highest and lowest attribution scores so I can see what words our model is using to make predictions, so this particular question is about pandas and picked up the D F series word column data frame, which is good because those are all words. which are specific to pandas, the contributing words were at least for each and you, which are pretty common words you can find in any stack overflow question, so it's a good sign that our model is working correctly , so let's go back to the demo and take a look at shap to make it look like our model has finished deploying.

I'm going to go ahead and set it as the default version and before we get into the workshop we're going to generate some predictions on our deployed model, so I'm going to write a file. with some test predictions and now we are going to use the Google Cloud CLI to get some predictions on our train model, so we call the G Cloud AI platform to predict, we pass it the name of our model which is stack or model , and we pass it our text instances. which is this predictions dot txt file that we just saved and we're going to run it and now we're going to print the predictions from our deployed model so it looks good.

We are calling our model which is implemented on the AI platform and has successfully predicted chaos. for the first question and pandas for the second question, so now let's use shop to explain our model, so shop doesn't call, it doesn't come pre-installed in collab, so we'll install it as a pip and then we'll also install this module called color. which I use to color the text with different words from the text, so once that's done installing we're going to import the store and we're going to be able to create our explainer, so we're going to create what's called a deep explainer store which has a lot of different types of explainers , we will use the deep explainer and we will pass it a subset of our training data and then we will create our array of attribution values so that we can call point configuration values from the explainer and we will pass it a subset of our test data, in this case we will simply take the first 25 examples from our test set so we have our attribution values, the next thing you want to do is print them out in a nice way, so first we'll take that tokenizer that we created in US, which is a dictionary and we'll convert to an array so we can take our store attribution values and see which word they correspond to, so we'll convert our tokenizer to a list of words and here we can see the most frequently occurring words that our model uses in the vocabularies of our model and the first thing we want to do is print a summary graph, which is a method that we have in the store to see a summary. of the words that have the most impact on our models predictions, so we'll pass it our store values, we'll also need to pass it the names of our functions, which is the list of words we just created and, along with the names of our classes, which covers the entire path.

Let's go back to our tag encoder, so if I typed that correctly we should get this nice graph here and what this tells us is the highest magnitude words that our model is looking at, so it looks like data frame was the word that our model used more to create. predictions probably because panda had the most questions about pandas in our model and we can see that the data frame was most impactful for pandas, it was also a signal word for the other libraries, probably a negative signal word and then we can see the other words that were impactful here, so plot was more important in matplotlib's predictions, which makes sense and we can see all of these other words in the list, which gives us a nice aggregate view of how our model is doing. making predictions for this particular data set and now, if we want to highlight individual words.

I have written some code to do this so that we can see for a particular question which words indicate predictions, so here is an example of a panda question, the words column, many pandas data frames have columns that do not they have a number of values and then we see common words like use and do are not being used by our model to make the tensorflow predictions, we get this by looking at words like session sash ops tensor here is an example of car house that looks like the most common words or lsdm layers dents etc. so now we can see how we can use shop to make sure our model behaves correctly.

One way you might want to use this in an equity example is if you have a model that's doing sentiment analysis and you want to make sure that it's using the right features to predict positive and negative things and whatever type of text you're using. . If we could go back to the slides to summarize, we'll put all of this together and get predictions from our model in a web application. So it would be really helpful if we could, while writing a stack overflow question, see what tags are associated with it. With the help of my teammate Jennifer, the person who should follow her on Twitter we created a Chrome extension and what is the Chrome extension.

What she does is take the HTML from our stack overflow question, passes it to a cloud function that I wrote using the cloud functions Python runtime. The function will use the AI Platform API to make a prediction in our model and then in our function. return only the name of the high confidence tags to our chrome extension, so let's see it in action, let's go back to the demo, here we go, okay, here's an example question about matplotlib in pandas. I just uploaded my chrome extension locally. it's not working right now wait a second live demos you never know let's try it one more time okay something's going on here.

I'll try the question draft, okay, it doesn't work, but luckily I have a video of what it was supposed to do. Sorry everyone I always have a backup so this is what I was supposed to do live demos you never know so I was supposed to predict matplotlib and pandas for this question and then we have a question that I have. written based on the code I wrote and some chaos planned for that. Sorry the demo didn't work, let's try it one more time, yeah, okay, looks like it's not working anyway, that's what the demo was supposed to do, um, yeah.

We could go back to slides, so today I covered a lot of different tools. I just wanted to give a summary of everything we built. We built a custom text classification model using TF Chaos. We implemented our model on the Cloudy I platform. I should mention that too. I could have also implemented that shap code in the AI platform using the custom code feature, but in this case I decided to keep it super simple and do that part separately. We explained the output of our model using shap to make sure our model behaved the way we expected and called it an AI platform with cloud functions.

You saw a video of what I was supposed to do. These are links to all the code I showed today and all the different products, so the first link goes to github. repository with the code for everything I showed, the other links are to various products I covered and finally we are working on model interpretability support on the Cloudy i platform, so if you have a use case if you want to stay informed and listen More on this, we'd love to hear from you, so fill out this form on the bitly link at the bottom and finally fill out the comments about this talk in your I/O app.

I really use it a lot to make the talk. better next time so thank you all

Watch Video & Subscribe

If you have any copyright issue, please Contact