11. Introduction to Machine Learning

May 10, 2020

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. ERIC GRIMSON: Okay. Welcome back. You know, it's that time of the quarter when we're all doing this. So let me see if I can get some smiles by just pointing out that the last class is in two weeks. It should deserve at least a small smile, right? Professor Guttag smiles. He likes that idea. You're almost there.

What are we going to do during the last two conferences? We are talking about linear regression. And I just want to remind you that this was the idea that I have some experimental data. Case of a spring where I put different weights to measure displacements. And regression was giving us a way to derive a model that fit that data. And in some cases it was easy. We knew, for example, that it was going to be a linear model. We found the best line that would fit that data. In some cases, we said we could use validation to allow us to explore and find the best model fit, whether linear, quadratic, cubic, or higher order.

More Interesting Facts About,

11 introduction to machine learning...

So we'll use that to deduce something about a model. That's a nice segue into the topic of the next three lectures, the last big topic of the class, which is

machine

learning

. And I'm going to argue that it's debatable whether that's actually an example of

learning

. But it has a lot of the elements that we want to talk about when we talk about

machine

learning. As always, there is a reading assignment. Chapter 22 of the book gives you a good start in this regard and other pieces will follow. And I want to start by basically describing what we're going to do.

And I'm going to start by saying, as I'm sure you know, this is a very broad topic. I have listed only five subjects in course six that focus on machine learning. And that doesn't include other subjects where learning is a central part. Therefore, natural language processing, computational biology, and computer vision robotics today rely heavily on machine learning. And you'll see them in those topics as well. So we are not going to compress five topics into three lectures. But what we are going to do is give you the

introduction

. We'll start by talking about the basics of machine learning.

The idea of having examples and how you talk about the features that those examples represent, how you measure the distances between them, and you use the notion of distance to try to group similar things together as a way of doing machine learning. And, as a consequence, we will see two different standard ways of learning. One, we call it classification methods. The example we're going to look at, there's something called "k nearest neighbor" and the second class, called clustering methods. Sorting works well when I have what we would call labeled data. I know the labels in my examples and I will use them to try to define classes that I can learn and have the grouping work well when I don't have labeled data.

And we'll see what that means in a couple of minutes. But we'll give you an early look at this. Unless Professor Guttag changes his mind, we probably won't show him today's really sophisticated machine learning methods, like convolutional neural networks or deep learning, things he'll read about in the news. But you'll get an idea of what's behind that if you look at what we do when we talk about learning algorithms. Before I do, I want to point out how common this is. And I'll admit that, with my gray hair, I started working on AI in 1975, when machine learning was a pretty easy thing to do.

And it's been fascinating to watch the change over 40 years. And if you think about it, think about where you see it. AlphaGo, Google's machine learning-based system that outperformed a world-class Go player. Chess has long been dominated by computers. Go now belongs to computers. The best Go players in the world are computers. I'm sure many of you use Netflix. Any recommendation system, Netflix, Amazon, pick your favorite, uses a machine learning algorithm to suggest things to you. And in fact, you've probably seen it on Google, right? The ads that appear on Google come from a machine learning algorithm that analyzes your preferences.

Scary thought. Drug discovery, character recognition: The post office recognizes handwritten characters using a machine learning algorithm and a computer vision system behind it. You probably don't know this company. It's actually a spinoff of MIT called Two Sigma, it's a hedge fund in New York. They heavily use artificial intelligence and machine learning techniques. And two years ago, his fund returned 56%. I wish I had invested in the fund. I don't have the millions you need, but it's an impressive return. 56% return on your money in one year. They didn't do so well last year, but they are doing extremely well using machine learning techniques.

Siri. Another big company out of MIT called Mobileye that makes computer vision systems with a heavy machine learning component that is used in assisted driving and will be used in fully autonomous driving. It will do things like slam on the brakes if you approach the car in front of you too fast, which will be really bad for me because I drive like a Bostonian. And it would be running constantly. Facial recognition. Facebook uses this, like many other systems, to detect and recognize faces. IBM Watson: cancer diagnosis. These are all just examples of how machine learning is used everywhere.

And it really is. I have only chosen nine. So what is it? I'm going to make an unpleasant statement. You're already used to that. I'm going to argue that you could argue that almost all computer programs learn something. But the level of learning really varies a lot. So if you remember the first lecture of 60001, we showed you Newton's method for calculating square roots. And you could argue that it should be expanded, but you could argue that that method learns something about how to compute square roots. In fact, you could generalize it to roots of any power of order.

But he didn't really learn. I really had to program it. Alright. Think back to last week when we talked about linear regression. Now it's starting to look a little more like a learning algorithm. Because what did we do? We gave it a set of data points, massive displacement data points. And then we showed them how the computer could basically fit a curve to that data point. And in a sense, it was learning a model for that data that you could then use to predict behavior. In other situations. And that's getting closer and closer to what we would like when we think about a machine learning algorithm.

We'd like to have a program that can learn from experience, something it can then use to deduce new facts. It's been a problem in AI for a long time. And I love this quote. It's from a gentleman named Art Samuel. 1959 is the quote where he says that his definition of machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. And I think a lot of people would say that he wrote the first show of its kind. He learned from experience. In his case, he played checkers. It kind of shows you how the field has progressed.

But we started with checkers, we got to chess and now we do Go. But he played checkers. He beat national level players, the most important thing is that he learned to improve his methods by observing how he did in games and then inferring something to change what he thought while he did it. Samuel did many other things. I just highlighted one. You might see in a follow-up course that he invented what's called alpha-beta pruning, which is a really useful technique for searching. But the idea is, how can we make the computer learn without being explicitly programmed?

And one way to think about this is to think about the difference between how we would normally program and what we would like from a machine learning algorithm. Normal programming, I know you're not convinced that normal programming exists, but if you think about traditional programming, what is the process? I write a program that I input into the computer so that it can then take data and produce some appropriate result. And the square root finder is really there, right? I wrote code to use Newton's method to find a square root, and then he gave me the process of giving any number, I'll give you the square root.

But if you think about what we did last time, it was a little different. And in fact, in a machine learning approach, the idea is that I will give the output to the computer. I'm going to give you examples of what I want the program to do, data labels, characterizations of different kinds of things. And what I want the computer to do is, given that characterization of results and data, for that machine learning algorithm to actually produce a program for me, a program that I can then use to infer new information about things. And that creates, if you will, a really nice loop where I can have the machine learning algorithm learn the program that I can then use to solve some other problem.

It would be really great if we could do it. And as I suggested, that curve fitting algorithm is a simple version of that. It learned a model for the data, which it could then use to label any other instances of the data or predict what it would see in terms of spring displacement as it changed masses. So that's the kind of idea we're going to explore. If we want to learn things, we might as well ask ourselves, then how do we learn? And how should a computer learn? Well, for you as a human, there are a couple of possibilities.

This is the boring one. This is the old way of doing it, right? Memorize facts. Memorize as many facts as you can and expect to be asked on the final exam for examples of those facts, as opposed to other facts you haven't memorized. This is, if we think about the first lecture, an example of declarative knowledge, truth statements. Memorize as many as you can. Keep Wikipedia in your back pocket. The best way to learn is to be able to infer, deduce new information from the previous one. And if you think about this, it's closer to what we call imperative knowledge: ways of deducing new things.

Now, in the first few cases, we incorporated it when we wrote that program to do square roots. But what we would like in a learning algorithm is for it to be much more similar to that idea of generalization. We are interested in expanding our capabilities to write programs that can infer useful information from implicit patterns in the data. So it's not something explicitly built like that comparison of weights and displacements, but implicit patterns in the data, and have the algorithm figure out what those patterns are and use them to generate a program that you can use to infer new data about objects, about string displacements, whatever you're trying to do.

OK. So the idea, the basic paradigm that we'll look at, is to give the system some training data, some observations. We did it last time with just the spring displacements. Then we'll try to find a way to figure out how we write code, how we write a program, a system that will infer something about the process that generated the data. And then from that, we want to be able to use it to make predictions about things we haven't seen before. Once again, I want to make this point clear. If you think about it, the spring example fits that model.

I gave them a set of data, spatial deviations relative to mass displacements. For different masses, how far did the spring move? Then I deduced something about the underlying process. In the first case, I said I know it's linear, but let me figure out what the actual linear equation is. What is the spring constant associated with it? And based on that result, I got a snippet of code that I could use to predict new displacements. You have all of those things, training data, an inference engine, and then the ability to use it to make new predictions. But that's a very simple type of learning environment.

So the most common one is the one I'm going to use as an example, that is, when I give you a set of examples, those examples have some data associated with them, some features, and some labels. For each example, you could say that it is a particular type of thing. This one is a different type of thing. And what I want to do is figure out how to make inferences by labeling new things. So it's not just about what the mass displacement is, but it's actually a label. And I'm going to use one of my favorite examples.

I'm a huge New England Patriots fan, if you're not, my apologies. But I'm going to use soccer players. So I'll show you in a second, I'll give you a series of examples of football players. The label is the position they play. And the data, well, it could be many things.We will use height and weight. But what we want to do is see how we would find a way to characterize the implicit pattern of how weight and height predict the type of position this player could play. And then come up with an algorithm that predicts the position of the new players.

We will make the draft for next year. Where do we want them to play? That is the paradigm. Set of observations, potentially labeled, potentially not. Think about how we make inferences to find a model. And then how do we use that model to make predictions? What we are going to see, and we will see multiple examples today, is that this learning can be done in two very broad ways. The first is called supervised learning. And in that case, for every new example that I give you as part of the training data, I put a label on it.

I know the kind of thing it is. And what I'm going to do is look to find a rule that would predict the label associated with the invisible input based on those examples. It's monitored because I know what the labeling is. Second type, if this is supervised, the other obvious one is called unsupervised. In that case, I'll just give you a bunch of examples. But I don't know the labels associated with them. I'm going to try to find what are the natural ways to group those examples into different models. And in some cases, I can tell how many models there are.

In some cases, I may want to simply say what is the best grouping I can find. OK. What I'm going to do today is not a lot of code. I expected applause for that, John, but I didn't get it. There is not much code. What I'm going to do is basically show you the intuitions behind this learning. And I'm going to start with my example of the New England Patriots. So here are some data points on the current Patriots players. And I have two types of positions. I have receivers and linemen. And each one is simply labeled by name, height in inches, and weight in pounds.

Alright? If I plot them on a two-dimensional graph, this is what I get, no big deal? What I do? I'm trying to learn, are their features that distinguish the two classes from each other? And in the unlabeled case, all I have is just a set of examples. So what I want to do is decide what makes two players similar. For the sake of seeing, can I separate this distribution into two or more natural groups? Similar is a distance measure. It says how I take two examples with associated values or characteristics, and we're going to decide how far apart they are.

And in the unlabeled case, the easiest way to do it is to say, if I know there are at least k groups there... in this case, I will tell you that there are two different groups there... how? Could you decide what's the best way to group things so that all the examples in one group are close to each other, all the examples in the other group are close to each other, and they are reasonably far apart? There are many ways to do it. I'm going to show you one. It is a very standard way and basically works as follows.

If all I know is that there are two groups there, I'll start by simply choosing two examples as my examples. Choose them at random. Actually, random isn't great. I don't want to get too close to each other. I'm going to try to separate them. But I choose two examples as examples. And for all other examples in the training data, I say which one comes closest. What I'm going to try to do is create groups with the property that the distances between all the examples in that group are small. The average distance is small. And see if I can find groups that make the average distance for both groups as small as possible.

This algorithm works by selecting two examples, grouping all other examples by simply saying: "put it in the group closest to that example." Once I have those groups, I'll find the middle element of that group. Not mean, but median, which is closest to the center? And treat them as examples and repeat the process. And I will do it several times or until I get no change in the process. It is then grouped according to distance. And we'll get back to the distance in a second. So this is what my soccer players would have. If I was just doing this based on weight, there's the natural dividing line.

And it makes sense. Alright? These three are obviously grouped together and, again, are right on this axis. They're all down here. These seven are in a different place. There's a natural dividing line there. If I did it based on height, not so clean. This is what my algorithm found to be the best dividing line here, which means that these four, again, based solely on this axis, are very close together. These six are very close together. But it's not that clean. And that's part of the problem we'll look at: how do I find the best clusters. If I use both height and weight, I get that, which was pretty cool, right?

Those three are grouped together. They are close to each other, in terms of distance on the plane. Those seven are close to each other. There's a nice, natural dividing line here. And actually, that gives me a classifier. This line is the equidistant line between the centers of those two groups. That is, any point along this line is the same distance from the center of that group as from that group. And then any new example, if it's above the line, I would say it gets that label, if it's below the line, it gets that label. We'll come back to how we measure distances in a second, but the idea here is pretty simple.

I want to find clusters close to each other and far from the other cluster. Now let's assume that I really know the labels of these players. These are the receivers. Those are the linemen. And for those of you who are football fans, you can find out, right? Those are the two tight ends. They are much bigger. I think that's Bennett and that's Gronk if you're really a big Patriots fan. But those are tight ends, those are wide receivers, and that will come back in a second, but there are the labels. Now what I want to do is say, if you could take advantage of the knowledge of labels, how would you divide these groups?

And that's something easy to see. The basic idea, in this case, is that if I have groups labeled in that feature space, what I want to do is find a subfloor that naturally divides that space. Now, underground is a fancy word. It says, in the two-dimensional case, I want to know which is the best line, if I can find a single line that separates all the examples with one label from all the examples with the second label. We will see that if the examples are well separated, this is easy to do and is great. But in some cases it will be more complicated because some of the examples may be very similar to each other.

And that's going to raise a problem that you saw in the last conference. I want to avoid overfitting. I don't want to create a really complicated surface to separate things. And so we may have to tolerate some mislabeled things if we can't get them out. And as you may have already discovered, in this case, with the labeled data, there is the line that best fits. Anyone over 280 pounds will be a great lineman. Anyone weighing less than 280 pounds is more likely to be a receiver. OK. So I have two different ways of trying to think about how to do this labeling.

I'll get back to both in a second. Now suppose I add some new data. I want to tag new instances. Now these are actually players from a different position. These are runners. But I mean, all I know is about receivers and linemen. I get these two new data points. I'd like to know, are they more likely to be receivers or linemen? And there are the details of these two gentlemen. So if I plot them again now, you'll notice one of the problems. So there are my linemen, the red ones are my receivers, the two black dots are the two running backs.

And watch right here. It's going to be very difficult to separate those two examples. They are so close to each other. And that will be one of the things we will have to negotiate. But if I think about using what I learned as a classifier with unlabeled data, there were my two groups. Now you see, oh, I have an interesting example. This new example, I would say, clearly looks more like a receiver than a lineman. But that one there is not clear. It is almost exactly along that dividing line between those two groups. And I would say: I want to rethink the grouping or I want to say: you know what?

As I know, there may not be two groups here. Maybe there are three. And I want to classify them a little differently. So I'll come back to that. On the other hand, if I had used labeled data, there was my dividing line. This is really easy. Both new examples are clearly below the dividing line. They are clearly examples that I would classify more as receivers than linemen. And I know that it is an example of football. If you don't like football, choose another example. But you understand why I can use the data in a labeled case and an unlabeled case to come up with different ways to build the clusters.

So what we're going to do in the next two and a half lectures is look at how we can write code to learn that way of separating things. We are going to learn models based on unlabeled data. That's the case where I don't know what the labels are, I'm just trying to find ways to group things nearby and then use the groups to assign labels to new data. And we're going to learn models by looking at labeled data and seeing how best to separate with a line or a plane or a collection of lines, examples of one group, from examples of the other group.

Recognizing that we want to avoid overfitting, we don't want to create a really complicated system. And as a consequence, we're going to have to make some tradeoffs between what we call false positives and false negatives. But the resulting classifier can label any new data simply by deciding where it lies with respect to that line of separation. So this is what you will see in the next two and a half lectures. Every machine learning method has five essential components. We need to decide what the training data is and how we are going to evaluate the success of that system.

We have already seen some examples of that. We need to decide how we are going to represent each instance we give it. It occurred to me to choose the height and weight of the soccer players. But maybe it would have been better to go with average speed or, I don't know, arm length, something else. How do I know which features are correct? And associated with that, how do I measure the distances between those entities? How do I decide what is close and what is not? Maybe it should be different, in terms of weight versus height, for example.

I need to make that decision. And those are the two things that we're going to show you examples of today, how to overcome that. Starting next week, Professor Guttag will show you how to take them and start building more detailed versions of measuring clustering, measuring similarities to find an objective function that he wants to minimize to decide which is the best clustering to use. And then what's the best optimization method you want to use to learn that model? So let's start talking about features. I have a number of examples, labeled or not. I need to decide what there is in those examples that is useful to use when I want to decide what is close to something else or not.

And one of the problems is that if it were really easy, it would be really easy. Features don't always capture what you want. I'm going to harp on that football analogy, but why did I choose height and weight? Because it was easy to find. You know, if you work for the New England Patriots, what do you really look for when you ask, what's the right characteristic? It's probably some other combination of things. Then you, as a designer, have to say what are the functions that I want to use. That quote, by the way, is from one of the great statisticians of the 20th century and I think it reflects it well.

So feature engineering, as a programmer, comes down to deciding what features I want to measure in that vector that I'm going to put together and how I decide the relative ways to weight it. So John, Ana and I could have made our jobs this quarter really easy if we had sat down at the beginning of the quarter and said, you know, we've taught this course many times. We have data on, I don't know, John, thousands of students, probably during this time. Let's just create a small learning algorithm that takes a set of data and predicts its final grade.

You don't have to come to class, you don't have to solve all the problems, because we will simply predict your final grade. It would not be good? Make our job a little easier and you may or may not like the idea. But could you think about predicting that rating? Now, why do I tell this example? I was trying to see if I could get some smiles. I saw a couple of them there. But think about the features. What do I measure? Actually, I'm going to put this on John because it's his idea. What would you measure?

Well, GPA is probably not a bad predictor of performance. If you do well in other classes, you will probably do well in this class. I'm going to use this one very carefully. Previous programming experience is at least a predictor, but it is not a perfect predictor. Those of you who haven't programmed in this class before can still do very well in this class. But it's an indication that you've seen othersprogramming languages. On the other hand, I don't believe in astrology. So I don't think the month you were born, the astrological sign you were born under, probably has anything to do with how well you would code.

I doubt eye color has anything to do with how well you would program. You get the idea. Some features matter, others don't. Now you could include all the features and hope that the machine learning algorithm separates the ones you want to keep from the ones you don't. But I remind you of that idea of overfitting. If I do that, there is a danger that I will find some correlation between birth month, eye color, and GPA. And that will lead us to a conclusion that we really don't like. By the way, in case you're worried, I can assure you that Stu Schmill, the dean of the admissions department, does not use machine learning to choose you.

In fact, he looks at a lot of things because it's not easy to replace him with a machine... yet. Alright. So what this says is that we need to think about how we choose features. And above all, what we're trying to do is maximize something called signal-to-noise ratio. Maximize the features that contain the most information and eliminate those that don't. So I want to show you an example of how you might think about this. I want to tag reptiles. I want to find a way to label animals, whether they are reptiles or not. And I'll give you just one example.

With just one example, you can't really do much. But from this example, I know that a cobra lays eggs, has scales, is poisonous, is cold-blooded, has no legs, and is a reptile. So you could say my reptile model is good, I'm not sure. I don't have enough data yet. But if I give you a second example, and it turns out that they are also egg-layers, they have scales, they are poisonous, cold-blooded, they have no legs. There's my model, right? A perfectly reasonable model, whether I designed it or a machine learning algorithm would do it, says that if all of this is true, label it a reptile.

OK? And now I give you a boa constrictor. Oh. It's a reptile. But it doesn't fit the model. And in particular, it does not lay eggs and is not poisonous. Then I have to refine the model. Or the algorithm has to refine the model. And this, I want to remind you, is to look at the characteristics. So I started with five shows. This doesn't fit. So probably what you should do is reduce it. I'm going to look at the scales. I'm going to look in cold blood. I'm going to look at the legs. That captures all three examples.

Again, if you think about this in terms of grouping, all three would fit with that. OK. Now I give you another example: chicken. I don't think it's a reptile. In fact, I'm pretty sure it's not a reptile. And it still looks great on this model, right? Because, while it has scales, which you may or may not realize, it is not cold-blooded and it has legs. This is, therefore, a negative example that reinforces the model. Sounds good. And now I'll give you an alligator. It's a reptile. And oh sweet, right? Does not satisfy the model. Because even though it has scales and is cold-blooded, it has legs.

I'm almost done with the example. But you see the point. Again, I have to think about how to perfect this. And you might say, okay. Let's make it a little more complicated... it has scales, cold blood, 0 or four legs... I'm going to say it's a reptile. I'll give you the dart frog. It's not a reptile, it's an amphibian. And that's good because it still satisfies this. So, it's an example out of the group that says it doesn't have scales, it's not cold-blooded, but it has four legs. It is not a reptile. That's good. And then I give you...

I have to give you a python, right? I mean, there has to be a python here. Oh, come on. At least it bothers me when I say that. There has to be a python here. And I'll give you that and a salmon. And now I'm in trouble. Because look at scales, look at cold blood, look at legs. I can't separate them. Regarding those characteristics, there is no way to find a way that correctly says that the python is a reptile and the salmon is not. And so there's no easy way to add that rule. And probably the best I can do is just go back to just two characteristics: scales and sang-froid.

And basically let's say, if something has scales and is cold-blooded, I'll call it a reptile. If it doesn't have both, I'll say it's not a reptile. It won't be perfect. You are going to mislabel the salmon. But here I have made a design decision that is important. And the design choice is that I won't have false negatives. What that means is that there won't be any cases of anything other than a reptile, which I'm going to call a reptile. You may have some false positives. So I did it the wrong way. A false negative says: everything that is not a reptile is going to be categorized in that direction.

I may have some false positives, meaning I may have some things that I will incorrectly label as reptile. And in particular, salmon will be an example of this. This trade-off between false positives and false negatives is something that worries us when we think about it. Because in many cases there is no perfect way to separate the data. And if you remember my New England Patriots example, that running back and that receiver were so close together in height and weight, there was no way I could separate them. And I just have to be willing to decide how many false positives or false negatives I want to tolerate.

Once I've figured out which features to use, which is good, then I have to decide about the distance. How do I compare two feature vectors? I'm going to say vector because it could have multiple dimensions. How do I decide how to compare them? Because I want to use distances to figure out how to group things together or how to find a dividing line that separates them. So one of the things I have to decide is what features. I also have to decide the distance. And finally, you may want to decide how to weigh the relative importance of different dimensions in the feature vector.

Some may be more valuable than others when it comes to making that decision. And I want to show you an example of that. So back to my animals. I started with a feature vector that actually had five dimensions. She was a layer, cold-blooded, she has scales, I don't remember which one was the other one, and the number of legs. So one of the ways I could think about this is to say that I have four binary traits and one integer trait associated with each animal. And one way to learn to separate reptiles from non-reptiles is to measure the distance between pairs of examples and use that distance to decide what is close to each other and what is not.

And as we said before, it will be used to group things together or to find a sorting surface to separate them. So here's a simple way to do it. For each of these examples, I'll let true be 1 and false be 0. So the first four are either 0 or 1. And the last one is the number of legs. And now I could say, okay. How do I measure distances between animals or anything else except these kinds of feature vectors? Here we will use something called the Minkowski metric or Minkowski difference. Given two vectors and a power, p, we basically take the absolute value of the difference between each of the components of the vector, raise it to the pth power, take the sum, and take the pth path from that.

So let's make the two obvious examples. If p equals 1, I simply measure the absolute distance between each component, add them together, and that's my distance. It's called the Manhattan metric. The one you've seen the most, the one we saw last time, if p is equal to 2, this is the Euclidean distance, right? It is the sum of the squares of the differences of the components. Take the square root. Take the square root because it makes it have certain distance properties. That is the Euclidean distance. So if I want to measure the difference between these two, this is the question.

Is this circle closer to the star or closer to the cross? Unfortunately, I put the answer here. But it differs depending on the metric you use. Good? The Euclidean distance, well, it's the square root of 2 times 2, so it's about 2.8. And that's three. So in terms of standard distance on the plane, we would say that these two are closer than those two. Distance from Manhattan, why is it called that? Because you can only walk along the avenues and streets. The distance to Manhattan would basically say it's one, two, three, four units away. This is one, two, three units away.

And under the Manhattan distance, this is closer, this pair is closer than that pair. Now you are used to thinking in Euclidean. Let's use that. But this will be important when we think about how we compare the distances between these different pieces. Typically, we will use Euclidean. We'll see that Manhattan really has some value. So if I go back to my three examples... boy, that's a gross slide, right? But here we go: rattlesnake, boa constrictor and dart frog. There is the representation. May I ask what is the distance between them? In today's handout, we provide you with a small snippet of code that would do that.

And if I actually do it, I get a nice little result. Here are the distances between those vectors using the Euclidean metric. I'm going back to them. But you can see that the two snakes are reasonably close to each other. While the dart frog is quite far from that. Nice, right? That's a nice separation that says there's a difference between these two. OK. Now I add the alligator. It sounds like a Dungeons and Dragons game. I add the alligator and want to make the same comparison. And I don't get such a good result. For now he says, as before, that the two serpents are close to each other.

But he says that the dart frog and the alligator are much closer, by this measure, than either of them are to the other. And to remind you, of course, of the alligator and the two snakes, I would like them to be close to each other and at a certain distance from the frog. Because I'm trying to classify reptiles versus not classify them. So what happened here? Well, this is a place where feature engineering will be important. Because, in fact, the alligator differs from the frog in three characteristics. And only in two traits of, say, the boa constrictor.

But one of those characteristics is the number of legs. And there, while on the binary axes the difference is between 0 and 1, here it can be between 0 and 4. So that is weighing the distance much more than we would like. The legs dimension is too big, if you like. How would I fix this? Actually, I would say this is a natural place to use the Manhattan distance. Why should I think that the difference in the number of legs or the difference in the number of legs is more important than whether it has scales or not? Why should I think it makes sense to measure that distance in the Euclidean sense?

They really are completely different measures. And actually, I'm not going to do it, but if I applied the Manhattan metric to this, it would put the alligator much closer to snakes, exactly because it differs in only two characteristics, not three. The other way around it would be to say that I'm allowing too much weight to be associated with the difference in leg count. So let's make it a binary feature. Either it doesn't have legs or it has legs. Run the same sorting. And now you see that the snakes and the alligator are all close to each other.

Whereas the dart frog, it's not as far away as it was before, but there's a pretty natural separation, especially using that number between them. What's my point? The choice of functions is important. In fact, including too many features can lead to overfitting. And in particular, deciding the weights I want for those features has a real impact. And you, as a designer or programmer, have a lot of influence on how you think about its use. So feature engineering really matters. How you choose features and what you use will be important. OK. The last part of this is we'll look at some examples where we give you data and features associated with it.

In some cases we will label them and in others we will not. And now we know how to think about how we measure the distances between them. John. JOHN GUTTAG: You probably didn't mean to say feature weight. You meant how they scale. ERIC GRIMSON: Sorry. The scale and not the... thanks, John. Not I did it. I'll take that back. I didn't mean feature weights. I meant that the scale of the dimension will be important here. Thanks for the expansion and correction. You're absolutely right. JOHN GUTTAG: We use weights in a different way, as we will see next time.

ERIC GRIMSON: And next time we'll look at why we'll use weights in different ways. So rephrase it. Block that out of your mind. We're going to talk about scales and scaling on the axes as something important here. And we already said that we will see two different types of learning, labeled and unlabeled, clustering and classification. And I want to finish by showing you two examples of that. How we would think about them algorithmically and we will look at them in more detail next time. As we look at it, I want to remind you of the things that will be important to you.

How do I measure the distance between examples? What is the correct way to design that? What is the correct set of features to use on that vector? And then what constraints do I want to put on the model? For unlabeled data, how do I decide how many groups I want to have? Because I can give you areally simple way to make groupings. If I give you 100 examples, I say build 100 clusters. Each example is its own group. The distance is really good. It's very close to itself, but does a terrible job of labeling things on it. So I have to think, how do I decide how many clusters and what is the complexity of that separation service?

How do I basically avoid the overfitting problem, which I don't want to have? Just to remind you, we've already seen a small version of this, the grouping method. This is a standard way of doing it, just repeating what we had on a previous slide. If I want to group it into groups, I start by saying how many groups am I looking for? Choose an example that I take as my first representation. For any other example from the training data, place it in the closest group. Once you have them, find the median and repeat the process. And that led to that separation.

Now, once I have it, I like to validate it. And in fact, I should have said it better. Those two groups appeared without looking at the two black dots. Once I place the blackheads, I'd like to validate: how well does this really work? And that example is really not very encouraging. It's too close. So that's a natural place to say: OK, what would happen if I did this with three groups? That's what I understand. I like that. Alright? There is a very nice group up here. The fact that the algorithm didn't know the labeling is irrelevant.

There is a good group of five. There is a good group of four. And there's a nice group of three in the middle. And in fact, if I look at the average distance between examples in each of these groups, it's much closer than in that example. And that brings us to the question of should I look for four groups? Ask, please. AUDIENCE: Isn't that overlap between the two groups a problem? ERIC GRIMSON: Yes. The question is: is the overlap between the two groups a problem? No. I just drew it here so I can let you see where those pieces are.

But in reality, if you like, the center is there. Those three points are all closer to that center than to that center. So the fact that they overlap is a good question. This is how I drew them. You should really draw them, not as circles, but as a slightly more complicated surface. OK? Having done three, you might say should I go for four? Well, those points down there, as I already said, are an example that will be difficult to separate. And I don't want to overadjust. Because the only way to separate them will be to create a really complicated group, which I don't like.

Alright? Let me finish by showing you another example from the other direction. That is, suppose I give you labeled examples. Again, the goal is to have features associated with each example. They are going to have multiple dimensions. But I also know the etiquette associated with them. And I want to learn how best to come up with a rule that allows me to take new examples and assign them to the correct group. Several ways to do this. You can simply say that I'm looking for the simplest surface that separates those examples. In my soccer case, they were on the plane, what is the best line that separates them, which is easy.

You could look for a more complicated surface. And we'll look at an example in a second where maybe it's a sequence of line segments that separates them. Because there is no single line that makes the separation. As before, I want to be careful. If I make it too complicated, I might get a really good separator, but I overfit to the data. And you'll see it next time. I'm going to highlight it here. There is a third way, which will lead to almost the same type of result called k nearest neighbors. And the idea here is that I have a set of labeled data.

And what I'm going to do is, for each new example, say find the k, say the five closest labeled examples. And vote. If 3 out of 5 or 4 out of 5 or 5 out of 5 of those tags are the same, I'll say it's part of that group. And if I have less than that, I will leave it as unclassified. And that's a good way to think about how to learn them. And let me finish by showing you an example. Now I won't use football players in this case. I'll use a different example. I'm going to give you some voting data.

I think this is actually simulated data. But it is a set of voters in the United States with their preferences. They tend to vote Republican. They tend to vote for Democrats. And the two categories are their age and how far they live from Boston. Whether they're relevant or not, I don't know, but they're just two things I'm going to use to classify them. And I'd like to say, how would you fit a curve to separate those two classes? I'm going to keep half the data for testing. I'm going to use half of the data for training.

So if this is my training data, can I tell what is the best line separating it? I don't know which one is the best, but here are two examples. This solid line has the property that all Democrats are on one side. Everything on the other side is Republican, but there are some Republicans on this side of the line. I can't find a line that completely separates them, like I did with the footballers. But there is a decent line that separates them. Here is another candidate. That dashed line has the property that on the right side you have... boy, I don't think this is deliberate, John, of course... but on the right side, you have almost all the Republicans.

It seems perfectly appropriate. A Democrat, but there's a pretty good split there. And on the left side you have a mix of things. But most Democrats are on the left side of that line. Alright? The fact that left and right correlate with distance to Boston is completely irrelevant here. But it packs a good punch. JOHN GUTTAG: Relevant, but not accidental. ERIC GRIMSON: But not by chance. Thank you. Alright. So now the question is, how would you evaluate them? How do I decide which one is better? And I'm just going to show you, very quickly, some examples.

The first is to look at what is called the confusion matrix. What does that mean? It says for this, one of these classifiers for example, the solid line. Here are the predictions, based on the continuum of whether they would be more likely to be Democrats or Republicans. And here is the actual label. The same goes for the dashed line. And that diagonal is important because those are the correctly labeled results. Good? In the case of the solid line, it correctly receives all the correct labels from Democrats. He agrees with half of the Republicans. But he has some where he's actually a Republican, but labels him as a Democrat.

That, we would like it to be really big. And in fact, it leads to a natural measurement called precision. I mean, going back to that, we say these are real positives. I mean, I labeled it as an instance, and it really is. These are real negatives. I labeled it as not being an instance, and it really isn't. And then these are the false positives. I labeled it as an instance and it's not, and these are the false negatives. I labeled it as not being an instance, and it is. And an easy way to measure it is to look at the correct labels on all labels.

The true positives and the true negatives, the ones I got right. And in that case, both models return a value of 0.7. So which one is better? Well, I should validate that. And I'll do that in a second by analyzing other data. We might also ask, could we find something with less training error? This is only 70% right. Not good. Well, here is a more complicated model. And this is where you start to worry about overfitting. Now what I have done is create a sequence of lines that separate them. So everything above this line, I'm going to say, is Republican.

Everything below this line, I'm going to say, is Democratic. So I'm avoiding that one. I'm avoiding that one. I'm still capturing a lot of the same things. And in this case, I get 12 true positives, 13 true negatives, and only 5 false positives. And that's a nice thing. You can see the 5. It's those five red ones down there. Its precision is 0.833. And now if I apply that to the test data, I get a correct result. It has an accuracy of about 0.6. I could use this idea to try to generalize and say if I can think of a better model.

And you'll see it next time. There might be other ways I measure this. And I want to use this as a last example. Another good measure we use is called PPV, Positive Predictive Value, which is how many true positives I can think of out of all the things I labeled as positive. And in this solid model, on the dashed line, I can get values around 0.57. Complex model based on training data is better. And then the test data is even stronger. And finally, two other examples are called sensitivity and specificity. The sensitivity basically tells you what percentage I found correctly.

And the specificity said what percentage I correctly rejected. And I show you this because this is where compensation comes into play. If the sensitivity is how many of those that I correctly labeled and incorrectly labeled as negative, how many did I correctly label as negative? type I want? I can make sensitivity 1. Label everything what I'm looking for. Excellent. Everything is correct. But the specificity will be 0. Because I will have a bunch of things labeled incorrectly. You could do specificity 1, reject everything. Don't say anything as an example. True negatives go to 1, and I'm in a great place there, but my sensitivity goes to 0.

I have an offset. As I think about the machine learning algorithm I'm using and my choice of that classifier, I'll see a trade-off where I can increase specificity at the cost of sensitivity or vice versa. And you'll see a nice technique called ROC or Receiving Operator Curve that gives you an idea of how you want to deal with it. And with that, see you next time. We'll delete your line question if you don't mind, because my time has passed. But see you next time, where Professor Guttag will show you examples of this.

Watch Video & Subscribe

If you have any copyright issue, please Contact