YTread Logo
YTread Logo

1. Introduction to Statistics

Jun 04, 2021
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT opencourseware at ocw.mit.edu. the course you are currently in is 18 650 and it is called fundamentals of

statistics

and until last spring it was still called

statistics

for applications and it turned out that actually based on the content of fundamentals of statistics it was a more appropriate title so I will tell you a little bit about what we're going to cover in class what this class is about what it's not about there are several like I realize there are several offers and stats on campus so I want to make sure you've chosen the right one Me too , and I also understand that for some of you it is a programming question.
1 introduction to statistics
In fact, I need to make a disclaimer. I tend to speak too quickly. I'm aware that you know someone back, just do it like this when you have I have no idea what I'm saying, I hope it's repeated many times, so if you average it over time you'll see that the statistics will tell you that you'll get the right message. which I was actually trying to send. What are the objectives of this class, so the first one is basically to give you an

introduction

, no one here is expected to have seen statistics before, but as you will see, you were expected to have seen probability and usually you see some statistics in a probability course, so I'm sure some of you have some ideas, but I won't wait for anything and we'll use math.
1 introduction to statistics

More Interesting Facts About,

1 introduction to statistics...

It's a math class, so you're going to know a lot of equations, not so much real data, and you know, statistical thinking. Try to provide the guarantee of the article, so that if I have two estimators available to me, how does the theory guide me in choosing between the best ones, how confident can I be in my guarantees or predictions, it is one thing to simply adjust a number and another es put in some error bars so we can see how to create error bars, for example, and know that you will have your own applications and will be happy to answer questions about specific applications, but instead of trying to adapt applications to an entire institute, I think we are I'll be working with pretty standard applications, mostly the ones you know aren't very serious and hopefully you'll be able to take the fundamental principles with you and apply them to your particular problem, so what I hope you take away from this class is that when you have a real life situation and by real life I mean mainly at MIT, so some people probably wouldn't call that real life, the goal is to formulate the sickle problem in mathematical terms if I want to say that it is an effective drug that it's not in mathematical terms I have to figure out what measurement I want to have to call it effective I want maybe it's over a certain period of time maybe it's over so there's a lot of things that you really need and I'm not really going to go and tell you how to get past the application. to the point where you need to be, but I will certainly describe to you where and where you need to be if you want to start applying this methodology above, once you understand what type of question they want to answer everyone wants a yes/no answer I want a number I want error bars I don't want to make predictions five years from now I have visual information or I don't have visual information all these things based on that, hopefully, you will have a catalog of statistical methods that you can use and you know how to apply on the wall and you also know that no statistical method is perfect.
1 introduction to statistics
Some of them have agreed people. over the years and people understand that this is the standard, but I want you to be able to understand what the limitations are and when you draw conclusions based on data, that those conclusions could be wrong, for example, okay, more practically, my goal here is have Are you ready? So who has taken, for example, a machine learning class here? Alright, many of you actually are maybe a third. I have taken a machine learning class. The goal here will be to take you to the fact that statistics has somewhat evolved towards machine learning in recent times. years and my goal is to take you there so that machine learning has a strong algorithmic component, so maybe some of you have taken a machine learning class that mainly shows the algorithmic component, but there is also a statistical component, machine learns of the data and, hopefully, that will be the case. statistical track and there is some statistical learning which is statistical machine learning classes that you can take here so I think they are offered at the graduate level but once you take those classes I want you to be ready to be able to take those classes having the fundamentals statistics to understand what you are doing and then you can expand to broader and more sophisticated methods, so the lectures are here from 11:00 to 12:00 30 on Tuesdays and Thursdays Victor Emmanuel will also be a Anakin's column vector as well will be doing mandatory recitation, so go ahead with stellar and take your recitation, whether it's from 3 to 4 or 4 to 5 on Wednesdays and it will focus mainly on problem solving, they are mandatory in the sense that, well, we have to allows you to do this, but no, they won't cover completely new material, but they might cover some techniques that you know that might save you some time when it comes to the exam, so you know you could come out ahead.
1 introduction to statistics
I know attendance won't be taken or anything like that, but I highly recommend you go because they are mandatory so you can't complain that something was taught only in recitation, so sign up for stellar to find out which of the two recitations you would like. To be there, your limit is 40, so on a first-come, first-served basis, homework will be handed out weekly. There are a total of 11 problem sets. I realize this is a lot. I hope we keep them light. I just want you to know, no, Russ. The best ten will be retained too much and this will represent a total of thirty percent of the final grade that they had to deliver on Mondays at 8:00 p.m. in stellar and this is something new, no we will not use the boxes outside of the math department, we will only use PDF files, so you can always write them and practice your basic technology or writing words I also understand that this can be a little stressful so You know, write them down on a piece of paper, use your iPhone and take a picture.
Dropbox has a nice new thing, so try to find something that puts a lot of contrast if you use a pencil, especially if you use a pencil, because we're going to check if they're readable and edible and it's your responsibility to have a particular readable file, so I've had that. over the years, not at MIT. I must admit, but I have done it. I had students who actually wrote a doc file and thought that converting it to PDF was consistent with deleting the extension and replacing it with PDF. That's not how it works, so I'm sure you'll figure it out.
Try to keep them letter size. This isn't a strict requirement, but I don't want to see thumbnails either. You were allowed to do assignments too late and by late I mean 24 hours late. Okay, no questions asked. You send them. This will be counted. You don't have to send them. an email notice or something beyond that, you'll be given slack for a zero grade and it's like four assignments too late, you'll have to come up with a really good explanation for why you actually need more extensions than that. If you ever do, and in particular, you'll need to keep track of why you used your three options before there are two midterms, one on October 3rd and the other on November 7th.
You will both be in class for a lecture period when I say they last an hour and 20 minutes it doesn't mean that if you arrive 10 minutes before the end of the lecture you will still get an hour in 20 minutes it will end in the lecture at the end of the lecture time For this, there is also no pressure, only the best of the two will stand and this will count for 30% of the grade. These will be closed books and closed notes, so the purpose is for YouTube. Yes, I say that the best of the two will be. maintained, yeah, the best of the two is not the best, we'll add them up, multiply the number by nine and that'll be your girl, no, so I'm trying to be nice, there's only so much I can do, so This will be The goal is for you to learn things and become familiar with them in the final.
You will be allowed to have your notes with you, but I want the midterms to also be a way to develop some mechanism for you not to do that. In reality, you waste too much time on things you should be able to do without thinking too much. You will be allowed to cheat because you know you can always forget something and there will be a double-sided letter sheet that you can extract as yourself. as small as you want and you can put whatever you want in this teacher, okay, the end will be decided by the registration, it will be three hours and it will represent 40%, now you can bring books, but you can bring your nose, yes, they are not, oh yeah there's one that's missing from both of them it's there yeah let's find out the syllabus is the real one the slides are so we can discuss but the ones that are in the syllabus are the ones that cam and I think that too are posted on the calendar about stellar and any other questions, okay, so the prerequisites here and who has seen the first set of problems are already okay, so the hands that go up realize that there is a real requirement probability prior for this class can be at the level of 18 600 or 604 one should say now there are two classes.
You know, I'll need you to know some calculus and have some notions of linear algebra, like what a matrix is, what a vector is. How do these things multiply? A notion of what our terminal vectors are. I will remind you very well of the hygiene vectors and your own values, but I remind you of all that, so this is not the strict prerequisite, but you have taken it. For example, it doesn't hurt to go back to your notes as we approach this chapter on principal component analysis, and the chapters as listed in the syllabus are in order so you can see what's really coming, so there's no need. textbook and I know this is a little I mean I know you don't like that you like having your textbook so you know where you're going and what we're doing I'm sorry it's just this class I wouldn't do it either I have to go to a book of mathematical statistics text, which is too much, or go to a more familiar engineering type statistics class, which is very little, so hopefully the problems that will be enough for you to practice the recitations will have some problems. solve too and the material will be posted on the slides, so you should have everything you need.
There are many resources online if you want to expand on a particular topic or read it said by someone else. The book I recommend in the study program is this one. book called all statistics by Wasserman mainly because of the title. I guess it's got it all, it's pretty broad, there's actually not that many, it's more of an introductory graduate level, but it's not very in-depth, but you see a lot of the overview certainly what we're going to cover is going to be a subset of what is there, so the slides will be posted on Stella before lectures, before starting a new chapter and after we are done with the chapter with annotations and also with Typos fixed as for the exam , so there will be some video lectures again.
The first one will be published in last year's OCW, but they will all be available in star module glitches of course, but hopefully this is an automated system and hopefully it won. It didn't work out well for us, so if you somehow have to miss a lecture, you can always catch up by watching it. You can also play it at that speed seven-five in case I end up talking too fast, but I think I already have. I managed fine, so just the last warning, why should you study statistics well? If you read the news, you will see a lot of statistics.
I mentioned that machine learning relies on a lot of statistics. Statistics are now if I were to teach this class. Ten years ago I would have to explain to you that you knew that collecting data and making decisions based on data was something that made sense, but now you almost know that in our lives we are used to this idea that data helps us make correct decisions and that's why people use data to do a study so here I found a bunch of press titles I think the keyword I was looking for was study find if I want to do this so I didn't actually bother doing it again this year this is all 2016 2016 2016, but the keyword I'm looking for generally studies well, so a new study finds that traffic is bad for your health, so you know, we had to wait until 2016 for the data to tell us that and you know there are many other words that are a little more interesting.
For example, one that couldWhat's interesting to you is that the study finds that students who benefit from waiting to declare their major benefit. Now there's a ton of news headlines, one that studies, you know, one in MIT News that studies, finds key brain connections for reading. So here we have kind of an idea of ​​what happened, we wrote that some data was collected, some scientific hypothesis was formulated and then the data was here to try to prove or disprove this scientific hypothesis, right, that's the usual scientific process and we need to understand how. the scientific process is because some of those things may be really questionable who is one hundred percent sure that the study reveals that students believe they benefit from waiting to declare an important right maybe some of you?
I mean, I would be skeptical about this right. I would be like I don't want to wait to declare a specialty, maybe this study, so what kind of things can we bring to the table? Well, maybe this study studied someone, people who were different than me, or maybe this study finds that a majority, that this is beneficial for the majority of people, I'm not a majority, I'm just one person, there is a Lots of things we need to understand what those things really mean and we'll see that they're not really statements about individuals, they're not even statements about the group of people who have actually seen their statements. about a parameter of a distribution that was used to model the benefit of waiting correctly, so there are a lot of questions and a lot of layers that go into this and we're not going to want to understand what was going on there. and trying to, you know, peel it back and understand what assumptions have been put in there even though it looks like a totally legitimate study, you know, of those studies, you know, statistically, I think there's going to be one that's wrong, well, maybe not one, but you could see them, if I put a long list of them, there will be some that would actually be wrong.
Well, if I put 20, there will definitely be one that is wrong, so you have to see that every time you see 20 studies, one is probably wrong when they are studies on the effects of drugs, that would be a lot of the list of a hundred and one would be wrong, so we'll see what that means and what I mean by that, okay, of course, not only are the studies that make these discoveries actually making the title of the press there's also the press that talks about limits on things that have no sense, so I love this first experiment, that salmon experiment, so it was actually a graduate student who came to the neuroscience poster session, pulled out this poster and explained what a science experiment it was. realization that consisted of taking a Simon salmon previously frozen in his mind, subjecting it to an MRI showing him images of violent images and recording his brain activity and he was able to discover some voxels that were activated with those violent images and can anyone tell me?
What happened here was that the salmon responded to the violent activity, so basically this is just a statistical fluke, it's just randomness at play, there are so many voxels that are recorded and there are so many fluctuations, there's always a little bit of noise when you run those things. that some of them were illuminated by chance and therefore we must understand how to correct that. In this particular case, we need to have tools that tell us that finding three voxels that activate for as many voxels that can be found in Salmon's brain is too small a number, maybe we need to find a group of 20 of them, for example , okay, then we will have mathematical tools that will help us find those particular numbers.
I don't know if you saw this. one from John Oliver about piracy or actually it's a key piracy and basically what John Oliver is saying is actually a whole stretch, like those long segments about this and he was explaining how you know there's a lot of, well, there's a sociology question here about how there's a big incentive for scientists to publish results, right, you're not going to say, you know what, this year I didn't find anything, so people are trying to find things and just searching is like if they were looking at all the voxels in a brain until they find one that just lit up by chance, then they just do all these studies and at some point one will be right just off the mark, so we have to be very careful when doing this, There are much more complicated problems associated with what is called pea hacking, which is violating basic assumptions, in particular, looking at the data and then formulating your scientific assumption based on and then going back to it, your I doesn't work, let's just formulate another and if he starts doing all this. the stakes are good, so actually the statistics, the theory that we're going to develop, are actually for a very clean use of the data, which could be a little unpleasant if they were spent, you know, if you've had an army of graduate students collecting a year's worth of genomic data, for example, maybe you don't want to say well, you know I had a hypothesis that didn't work, let's throw the data in the trash, so we need to find ways to be able to do this and, In fact, a course is being taught.
In vu, I mean, it's still in its early stages, but there's something called adaptive data analytics that will allow you to ask these types of questions. Okay, of course, the statistics are not just for you to read the press statistics, they will probably be used in Whatever career you choose, you will know that it began in the 10th century in the Netherlands in hydrology. Holland is basically on water and sea level, so they wanted to build some types, but you know, once you will be. a small levee, you want to make sure that it's going to hold up to some flooding and so in particular they wanted to build levees that were high enough but not too high, you know, you can always say well, you know, I'm going to build, you know, a 500 levee. meters and then I'll be sure, you want something that's based on correct data, you want to make sure and in particular what they did right, they collected data from previous floods and then they just found a label. that was going to cover all of those things now if you look at the data that they probably had, maybe it was sparse, maybe they had 10 data points and for this the data points that maybe they wanted to interpolate between those points maybe extrapolated for the biggest one based on what they've seen, maybe they have a chance of saying something that's even bigger than anything they've seen before and that's the problem that that's exactly the goal of statistical modeling, to be able to extrapolate further. beyond the data you have guessed.
What you haven't seen yet can happen when you buy insurance for your car, your apartment or your phone, there is a premium you have to pay and this premium has been determined based on how much you expect the insurance to cost you. right, he says, "Okay, this person has a ten percent chance that you know that breaking their iPhone on the next phone will cost that much to fix, so I'll charge them that amount and then add an extra dollar for my time. Okay , that's basically it." how do I determine those things and this is using statistics, you basically know where statistics are probably mainly used, which was personally trained as an actuary and that means being a statistician in clinical trials for an insurance company, this is also one of the first success stories of statistics, so it's actually now what spreads if every time the FDA approves a new drug on the market and requires a very strict testing regimen with data and a control group and a treatment group and how many people need there and what Do you know the meaning you need for those particular things?
Those things look like this, so now there are more, five thousand patients. I mean, it depends on what kind of medicine it is, but for us, 100 patients, 56 were cured out of 44, they showed nothing. Does the FDA consider this a good number? Do you have a table of how many patients were cured? Is there a placebo effect? Do I need a control group of people who are actually receiving a placebo? It's not clear about all of these things, so there's a lot of things that go into practice and there's a lot of floating parameters, so hopefully we can use statistical models to narrow it down to a small number of parameters so we can ask very simple questions.
Is it an effective drug? It's not a mathematical equation but if P is greater than 0.5 it's a mathematical question, okay and that's essentially what we're going to do, we're going to take this is an effective drug to reduce to a variable greater than 0.5 now. Of course, genetics is using it so you know it's actually the same kind of data size that you would see for fMRI data, given the genotype of this is actually a study that I found, so you have about 4,000 Alzheimer's cases and 8,000 controls, so people without Alzheimer's, that's what's called a control, this is something just to make sure that you can see the difference with people you know who are not affected by a drug or for a disease, it's the April II gene associated with Alzheimer's disease, right, I mean, everyone can see. why this would be an important question, now we have CRISPR, it is targeting very specific genes, if we could edit it, knock it down, increase it or boost it, maybe we could have an impact on that, so those are very important questions because we have the technology to target those things, but we need answers about what those things are, so there are a lot of other questions that you know by the time you go talk to the biologists to say, "I can do it there" and say, "Okay , they are there?" There are genes within the gene, particularly clippings, that I can actually look at and you know they're looking at their different questions.
Now when you start asking all these questions you have to be careful because you are reusing your data again and this could lead to you. to erroneous conclusions and those are everywhere these days and that's why they go to John Oliver talking about them, any questions about those examples, so this is really a motivation again, we're not going to just take this data set from those cases and see them in detail, so what is common to all these examples, like why do we have to use statistics for all these things? Well, there's the randomness of the data, there's some effect that we just don't understand well, so for example, the randomness associated with the alignment of some voxels or the fact that, when it comes to insurance, you know if you're going to whether to break your iPhone or not, it's essentially a double cost, so it's totally biased, but you know it's a double cost.
So all of these things from the perspective of the statistician, those things are actually random events and we need to retain this randomness to understand this randomness. There will be many random stories. There will be little randomness. This will be something like, you know. of its people, you know, so let's look at, for example, the right defects, the floods that I see were consistently almost the same size with almost a rounding error or they are just very widespread. All these things we must understand. so we can understand how to build those dams or how to build some to make decisions based on that data and we need to understand this randomness.
Okay, so you know that the questions associated with randomness were actually hidden in the text, so we talked about the notion of average right, when it comes to insurance, they want to know on average what is the probability, what is their probability? of your iPhone actually breaking and that's what came up in this notion of fair premium, there's this notion of quantifying the probability. right, we don't want to talk maybe just about the average, maybe you want to cover, say, 99 and of the floods, then we need to know what is the height of a flood that is higher than 99% of the floods, but maybe there is a 1% of them, you know, when doomsday comes, they will come good, we are not going to pay for it, so that is the majority of the floods and then there are questions of importance, so you know, we give this example a second ago about clinical trials.
You have some numbers, clearly the drug cured more people than it didn't, but does that mean it's significantly good or was it just by chance? Maybe it's just that you know these people just recover, it's like you know how to cure the common cold and you feel like Oh, I'm cured, but you actually waited five days and then you were cured, so there's this notion of meaning of variability. All of these things are actually notions that describe randomness and quantify it into simple things. Randomness is a very complicated beast, but we. I can summarize it in two things that we understand as I am a complicated object I am made of molecules and genes I am made of very complicated things but they can be summarized as you know my name my email address my height and my weight and maybe for the most of you this is basically enough, okay, you will recognize me without having to do it, you know, biopsy me everytime they see me, so to understand the randomness, they had to go through probability. study of randomness, that's what it is, that's what the first sentence that a probability professor will say and that's why I need the prerequisite because this is what we're going to use to describe randomness, we'll see in a second how it interacts. with statistics, so sometimes and probably most of the time throughout his semester the probability of randomness was very well understood when he saw a probability problem here was the possibility of this happening here was the possibility of that happened maybe I had more complicated questions that I had some basics to answer, for example, you know, the probability of having HBO is so big in the problem, watching Game of Thrones is so big and since I play basketball, what's the probability?
You know you had these crazy questions but you were able to answer them. build them and-and-and-and but all the basic numbers were given to you statistics will be about finding those basic numbers so some examples that you've probably seen where you know dice cards roulette toss coins all of this is our Things you've seen in a probability class and the The reason is that it is very easy to describe the probability of each outcome for a diamond. We know that each face will have a probability of a set. Now I'm not going to get into the debate.
Whether this is pure randomness or whether this is determinism, I think that as a model for real randomness, well, that's a pretty good number of tosses of a coin, that's pretty good, that's a pretty good model, well, those are really good, so that the questions you would see, for example, in probabilities are the following: I roll a die. Alice receives $1 if the number of points is less than three Bob receives two dollars if the number of points is less than two Do you want to be Alice or Bob, even if your role is actually to make money?
Yes, you want to be Bob, right? let's see why, if we look at the expectation of what Alice wins, let's call it this is $1 with probability half right, so three six, so that's half and the expectation of what Bob wins is $2.00 with probability six and that's one third, two thirds, okay, which is definitely greater than half, so Bob's expectation is actually a little bit higher, okay, so those are the kinds of questions you can ask with probability. I describe you exactly, you use the The fact that the die was on would get less than three points with probability of half.
We knew it and I didn't have to describe to you what's going on there. You didn't have to collect data on a die. The same. You roll two dice, choose a number between two and 12, and win $100. If you choose the sum of the two dice, what number do you choose? Why is seven the most likely? So your win here will be $100 times the probability that the sum of the two dice, let's say and that is the one that maximizes this function of Z, okay, and for this you need to study a more complicated function, but it is a function that involves two dice, but you can calculate the probability that X plus y equals Z for each Z between two and 12, so you know exactly what the probabilities are and that's how to start with the probability, okay, then okay, that's exactly what I said.
You know, you can have a very simple process that describes basic events, probably a six for each of them, and then you can build on that and understand the probability of more. complicated events you can invest some money there you can build functions you can do very complicated things based on that now if I were a statistician, a statistician would be the guy who just came to earth. I've never seen a Dainese to understand that died has a 1/6 chance on each side the way we do it is just roll to die until you know and get some math and try to estimate those things right and maybe the type come and say well, actually you know. the probability of him getting a one is, you know, 1/6 plus zero dot zero zero one and the probability of him getting a two is 1/6 minus 0.005 and you know there's going to be some fluctuations around this and it's going to be his role as statistical to say, listen, this is too complicated a model for this and they should all be the same numbers, just looking at the data, they should all be the same numbers and that's part of modeling, you make some simplifying assumptions that essentially make your Reason your questions are more precise now, of course, if your model is wrong, if it's not true that all phases are correct with equal probability, then you have a model error here, so we'll make model errors. but that will be the price to pay to be able to extract anything from our data.
Well, for more complicated processes, of course, no one will waste time rolling the dice. I mean, I'm sure you could have done this in AP Stat or something. but the need is to estimate the parameters from the data, so for more complicated things you might want to estimate some density parameter that you know in a particular set of material and for this you might need to be something and measure how fast came back. and you will have some measurement errors that you may need to do that several times and you need to have a model for the physical process that is actually happening and physics is usually a very good way to get models for the engineering perspective, but there are models. for sociology where we don't have a physical system, right?
I mean, you know God knows how people interact and maybe I'm going to say the way I make friends is to first throw a coin in my pocket and probably two thirds of it will make my friend at work and it was probably one third. I will make my friend in football and once I make my friend in football. I decided to make my friend football. Then I'll be up against someone who's flipping the same coin with maybe slightly different parameters and you. I know those things really exist, there are models for how friendships form and this is the one I describe, it's great to mix membership models, so those are models that are the size of an egg and more reasonable than you think.
Take into account all the things that made you meet that person at that particular moment. Well, the goal here, based on the data now, once we have the model, it will come down to maybe two, three, four parameters depending on how complex the model is and then your goal will be to estimate those parameters, so sometimes the randomness we have here is actually correct so there is some randomness in some surveys if I choose a student at random as long as you believe that my random number generator that will choose your random id is actually random there is something random in you, right, the student I choose at random will soon be random, the person I call on the phone is the random part, so there is some randomness that I can incorporate into my system by extracting something from a random number generator a coin skewed is a random thing, right, it's not a very interesting random thing, but it's true, it's a random thing again if I look at the fact that it's actually a deterministic mechanism, but with some precision and some granularity, this can be considered as an error. of measurement of a truly randomized experiment, for example, if you use some measuring device or some optical device, for example, you will have a standard deviation and things that come on the side of the box and it tells you that this will do some measurement there and usually you know there are maybe moments or things like this and those are very accurately described by some random phenomenon, but sometimes I would say most of the time there's no randomness, it's not like you know you're breaking your iPhone. it's a random event this is just something we sweep randomness is a big rug under which we sweep everything we don't understand and we just hope that on average we've captured some sort of average effect of what's happening and the rest we could see down the road. right, it might fluctuate to the left, but what's left is just a kind of randomness that can be averaged out, so of course this is where we don't have the leap of faith, we don't know if we are.
It is correct to do this, maybe we make big systemic biases by doing this, maybe we forget a very important component, for example, if I have, I don't know, let's think of something, a drug for breast cancer, and I dismiss the fact that My patient is a man or a woman. I'm going to have serious model biases, so if I say you know, I'm going to collect some random patients and start doing this. I need there to be certain information that I really need clearly. build in my model, okay, so the model should be complicated enough but not too complicated, so you have to take into account the things that are there systematically, the important thing, okay, in particular, you know the simple rule of thumb is that when you have a complicated process, you can think of it as a simple process and some random noise.
Now again, random noise is everything you don't understand about the complicated process and the simple processes, everything you actually do, okay, very good modeling and this is not what we're going to look at. in this class is about choosing simple plausible models and this requires an enormous amount of domain knowledge and that's why we won't do it in this class, this is not something where I can make a general statement on how to make a good model, need to know if I was a statistician and worked on a study. I would have to question the person in front of me, the expert, for two hours to know, oh, but how about this, how about that, how does this work?
So, it is necessary to understand many things. There is a famous statistician and whoever is credited with this quote probably isn't his then, but Tookie said he loves being a statistician because you can play in everyone's backyard, so you can go see people and understand at least until At a certain point, what their problems are is enough for you to build a reasonable model of what they're actually doing, so you can do some sociology, do some biology, do some engineering, and do a lot of things. of different things, so at some point I was predicting the presidential election, so you can see that you can do a lot of different things, but it requires a lot of time to understand what problem you are working on and if you have a particular application in mind, you are the best person to really understand this, so I'll just give you the basic tools.
Okay, so this is the circle of trust. No, this is just a simple graph that tells you what it is. It happens when you do the probability, they give you the truth, someone tells you what died. God is working, so you know exactly what the parameters of the problems are and what you are trying to do is describe what the results will be. you can say, well, you know if you're rolling a fair die, you're going to have 1/6 of the time in your data, you're going to have one side of the time you're going to have, and then I can describe if I told you what the truth is, you could go into a computer, generate some data or you could describe to me some macro properties that you know of what the data would be like.
Oh, you'd see a lot of numbers that were centered around 35 if you took it out of a Gaussian distribution centered at 35, right, you'd know this kind of thing. I would know that you know it's very unlikely that if my Gaussian has a standard deviation centered at zero and zero with let's say a standard deviation three, it's very unlikely that I'm going to see numbers between minus 10 and plus 10, right? You know that you basically won't see them, so you know it from the truth of the distribution of a random variable that doesn't have the actual numbers of mu or Sigma. you're going to be what data are you going to have statistics is about going back, it says if I have some data, what was the truth that generated it and since there are so many possible truths, modeling says that you have to choose one of the simplest possible. troops so you can average the statistics basically means average your average when you do statistics and averaging means you know if I say I received the then, if I collect all your GPS, for example, in my model, are like this the possible GPA or any possible numbers and Anyone can have a possible GPA, this will be a serious problem, but if I can summarize those GPAs into two numbers, say mean and standard deviation, then I have a pretty good description of what's going on instead of having to predict an entire list, right, if I learn a whole list of GPAs, I say well, this was the distribution, so it's not going to be any use to me to predict what an X GPA would be or some random student coming in or something like this, okay? just to finish my rant about probability versus statistics this is a question that you would see in a parable this is a probabilistic question and this is a statistical question the probabilistic question is a previous study that showed that the drug was 80 percent effective, so you know this is the effectiveness of the medicine given to you, this is how your problem starts, we can anticipate that for a study on 100 patients on average, AE will be cured and at least 65 will be cured with a 99% chance, Sookay, I go there, I go to log in, I call it, I go right left right and then I see and then I see What is the name of the fish place?
You know, yeah, I go to Wall Burgers in Logan and I'm like, "Okay, I'm done for the day." Collect this data. I go home and say, "Okay" 66.7% right that's a pretty big number, it's even further away from 50 percent and this other guy, so I'm doing even better, but of course, you know That this is not true, three people are definitely not representative if I stop the first one. I would have eliminated the first two. It would even have 100%, so the question of whether statistics will help a sensor is how large the sample should be. For some reason, I don't know if they received a time affiliated with the Broad Institute and since then I get one email a day saying sample size determination: how big should the sample be?
I know how great my example must be. I've taken 18 650 several times and I know this, but the question is five 124 big enough. number or not, the answer is actually the usual one, it depends, it will depend on the true unknown value of P, but from those particular values ​​that we arrived at 120 and how many couples were in 80, we can actually ask some questions, so here We said that 80 was greater than 50 which allowed us to conclude at 64.5%, so that could be a reason to say that it was greater than 50%. 50% of 124 is 62, so the question is: would you be willing? to reach this conclusion, 63 is correct is that a number that would convince you who would be convinced by 63 who would be convinced by 72 who would be convinced by 75 hopefully the number of hands raised should increase who would be convinced by 80 all true, basically those numbers don't actually come from anywhere, this 72 would be the number you would need for a study, most statistical studies would be the number they would retain, it's not for 124, you would need to see 72 that turns heads.
It's okay to come to this conclusion, okay, and then 75, so we'll see that there are many ways to come to this conclusion because, as you can see, this was published in Nature with 80, so it was okay, so 80 is actually a very large number. a ninety-nine point this is 99 percent no so this is 95 percent confidence this is 99 percent confidence and this is 90 nine point nine percent confidence very well then if you said eighty you are a veryA conservative person from age 72 can start to come to this conclusion, okay, so to understand this we need to do our little math kitchen here and we need to do some modeling, okay, so we need to understand by modeling, we need to understand what random process we think .
This data is generated from the right, so it will have some unknown parameters as opposed to probability, but basically we need to have everything written down except the parameter values, just as I said a die comes uniformly with a probability of 1/6 , so I need I said maybe with probability maybe I should say there are six numbers here and I just need to fill in those numbers, so I call 1 to n. I am going to define RI as the indicator, an indicator is simply something that takes value 1 if something is true and 0 if it is not, it is an indicator that the ice couple turns their head to the right, okay, then RI is indexed by I and is 1 if the ice partner turns its head to the right and 0 if so. okay, actually I guess they could probably kiss directly, so that would be weird, but they could do this, so let's say it's not okay, so the estimator of P that we said was P, which was just the ratio of two numbers, but actually what am I counting, I add those are eyes since I only add the ones that take the value 1 what is this thumb is really this sum here it is actually just counting the number of ones, which is another way of saying, it is counting the number of couples that are kissing on the right and here I don't even have to tell you anything about the numbers or anything I can just keep track of the first couple is zero the second couple is a third couple is a zero they are The data set I can discover online is actually a sequence of zeros and ones, not clearly for the question we are asking about this ratio.
I don't need to keep track of all this information, all I need to keep track of is the number of zeros and the number of ones are completely interchangeable no, it doesn't matter, there is no time effect on this, there is no, the first pair is no different of the fifteenth even, so we call this our final bar and that will be a very standard notation that we will use. our could be replaced by other letters like this. the sum was equal to 80 in our example and n was equal to 124 now this is an estimator so an estimator is different from an estimate an estimate is a number my estimate was sixty-four point five my estimator is this thing is this thing where I keep all the variables are free and in particular I keep those variables to be random because I'm going to think of a random couple kissing left or right as the result of a random process, like flipping a coin, starting with heads or tails, okay, and this here. is the random variable, and this average is of course an average of random variables, itself is a random variable, so an estimator is a random variable, an estimate is the realization of a random variable, or in other words In words, it is the value obtained for this random variable. variable once you put in the numbers that you've collected so I can talk about the precision of an estimator, precision means what do we want for an estimator, maybe you want it to fluctuate too much right, it's a random variable so I'm talking about the precision of a random variable, so maybe I don't want it to be too volatile, right?
You could have an estimator that would simply be to discard 182 pairs, keep only two, and average those two numbers, which are definitely the worst estimators to keep. all one hundred and twenty-four, so I need to find a way to say that and what I will be able to say is that the number will fluctuate if I take another two couples, I will be able to do that. I'll probably get a completely different number, but if I take another 124 couples two days later, maybe I'll get a number very close to 64.5%, so that's one way and the other thing we'd like about this estimator is that it actually en No, maybe it's not too volatile, but we also want it to be close to the number we're looking for.
If the number here is an estimator, it's a beautiful random variable, seventy-two percent is an estimator, go out, make your favorite. study about drug performance and then you know they'll call you, you know, MIT student who takes statistics, they say, so how are you going to be able to calculate with your estimator that we've collected those five Taliban? I said no, I'm just going to spit out seventy-two percent whatever the data says that's an estimator it's stupid this behavior but it's an estimator but this is the meter it's not very volatile every time you're going to have a new study and then even if you change field they will still be 72%, this is beautiful and the problem is that it is probably not very close to the value we are actually trying to estimate, so we need two things: we need our estimator to be a random variable, so think in terms of densities we want the density to be pretty narrow, we want this to have very little to be fine, so this is definitely better than this, but we also want the number we're interested in, P, to be very close from this. close to the values ​​that this will probably take if P is here, this is not very good for us, okay, that's basically what we'll see.
The first is known as variance. The second is known as. bias, those things appear everywhere in statistics, so we need to understand a model well, so here is the model we have for this particular problem, so we need to make assumptions about the observations that we see correctly, so we said let's to assume the random variable, that's not a huge leap of faith, we're just sweeping over the rock everything we don't understand about those couples and the assumption we make is that each RI is a random variable, okay, you'll forget this one Very soon, the second is that each of our eyes is a random variable that takes the values ​​0 and 1.
Anyone can suggest the distribution of this random variable, which Bernoulli writes and it is really beautiful, this is where you have to do less statistical modeling. The random variable that takes the value 0 1 is always a bernal, which is the simplest variability imaginable. Any variable that takes only two possible values ​​can be reduced to a burning state, then the Virna levy and here we assume that it actually takes the parameter P correctly and there is an assumption here, can anyone tell me what the assumption is, yes it is the same right I could have said P I but it's P and that's where I'll be able to start doing some statistics.
I can start extracting information from all my guys if I assume they are all P. I am completely disconnected from each other then I am in trouble there is nothing I can get and then I will assume those guys are mutually independent and most of the time they will just say independent, which means you know it's not like all these guys called each other and it was actually a flash mob and said let's all turn our heads to the left and then you know this is I'm definitely not going to give you a valid conclusion, okay, again, randomness is a way to model lack of information.
Here is a way to solve it. Maybe I could have followed all those guys and known exactly who they were. Maybe I could let you know. I looked at pictures of them in the womb and guess how they turned? Oh, by the way, that's one of the conclusions you're guessing: we turn our heads to the right because our heads are mostly turned to the right in the womb. We know that we don't know what goes on in the minds of kissers and you know that physics and ology there are many things that could help us, but it is too complicated to keep track of or too expensive in many cases, now again that is the best part. of this modeling was the fact that our eyes take only two values, which means that this conclusion that they were Bernoulli was totally free for us once we knew that it is verbal random, it is a burning, now they could have been like we said, they could have been a Burnley with the parameter P I for each I.
I could have put in a different parameter, but I just don't have enough information. What would I say? Well, the first couple of right turns P I p1 has to be my best guess. the second pair of children on the left, well p2 should be 0, that's my best guess, so basically I need to be able to average my data and the way to achieve that is by coupling all these guys, P should be the same P for Everything I'm fine with, does it make sense here? What I'm assuming is that my population is how genius is okay, maybe it's not, maybe I could look at a finer grain, but I'm basically making a statement about a population, so you know, maybe. you kiss the left and then you don't.
I am not making a statement about an individual person. I am making a statement about the general population. Now independence is probably reasonable. This person just left and you know you can seriously expect that. these people, this couple didn't communicate with each other or you know Kenya didn't send text messages saying we should all turn our heads to the left now and there's no external stimulus forcing people to do anything different, okay? Sorry, since we have less than ten minutes, let's do some exercises that look good to you, so I just have some exercises so we can see what the exercise will look like, you know, something similar to the exercise. you see maybe we should do it together, so now we're going to have a test, okay, so it's a test with probability, okay, and I'll have 15 students in this test and hopefully it should be 15 grades.
They are representative of the grades of an entire large class, so if you took 18600, it is a large class, there are definitely more than 15 students and maybe by just taking a sample of 50 students at random I want to get an idea of ​​what my grand will be. distribution. It seems okay, I'm grading them. I want to do, do you understand? Yeah, okay, so I'm going to make some modeling assumptions for those guys, okay, and here are 15 students and the grades are X 1 2 X 15, just like us. I had r1 r2 until our 124, those were my eyes, so now I have my X eyes and I'm going to assume that eye X follows a Gaussian or normal distribution with mean mu and variance Sigma squared.
Now this is modeling, anyone? told me that there is no physical process that makes this happen, we know that there is something called the central limit theorem deep down that says you know that thingsThey tend to be Gaussian, but this is really a matter of convenience, actually, this is if you think about it that way. It's terrible because this puts a non-zero probability on negative scores. I'm definitely not going to get a negative score, but you know it's good enough because I know the probability is non-zero, but you probably know it's 10 to the power of negative 12, so I would.
I'm very lucky to see negative scores so here is the list of ratings so I have 65 4179 T 58 82 76 78, maybe I should have done it with 8 59 69 sitting next to each other 84 89 134 51 and 72, Well, those are the scores. that I got, clearly there were some bonus points there and the question okay, that's the tomato estimator from you, what's my estimator from you, well the estimator again is something that depends on the random variable, okay , then mu is the correct expectation, so a good estimator is definitely, the average score is fine, just like we had the average of our eyes, now the excise tax no longer needs to be zeros in one, so it is not will reduce to be a number of ones divided by the total numbers now if I'm looking. an estimate, well actually I need to add those numbers and divide them by 15, so my estimate will be 1 over 15 and then I'll start adding those numbers 6 to 5 plus 72, okay, I can do that and it's 67.5 okay. so this is my estimate now if I want to calculate a standard deviation, let's say an estimate for Sigma.
You have already seen that before an estimate for Sigma is what an estimate for Sigma is and we will see methods to do this, but Sigma squared is the variance with the expectation of X minus the expectation of X squared and the problem is that I don't know what they are those expectations so I'll do what 99.9% of statistics is what this stat is about, that's what my moto stat is about to replace expectations with averages, that's what all statistics are about, there are 300 pages in a purple book called statistics that tells you this well and then you do something fancy, maybe you minimize something after replacing the expectation, maybe you need to add other things, but really, every time you see an expectation, you replace it with an average.
Well, let's do this so that the square hat Sigma is what will be 1 over n sum of I equals 1 to N of X I; well, here I need to replace my expectation with an average which is actually this average. I'll call it new square hat. There you have it. I have replaced my expectations with average. Okay, so the golden thing is to take your expectations and replace them with this framework. Okay, get me a tattoo. I do not do it. It doesn't matter, but that's what it is, if you remember one thing from this class, that's what it is now, you can be fancy, if you look at your calculator, it's going to put an N minus one here because it wants to be unbiased and those are the things.
I'll go too, but let's say now we stick with this and then when I enter my numbers, I'll get a Sigma estimate, which is the square root of this estimator estimate, once I enter the numbers and you'll be able to. controlthat the number you will get will be eighteen, okay so those are basic things and if you have taken any AP statistics, this should be completely standard for you now I have another list and I don't have time to look at it. It doesn't really matter, okay, we'll do it next time, okay, we'll look at another list of numbers and see that we're going to think about modeling the assumption, the point of this exercise is not to calculate those things, it's really to think about assumptions of modeling, is it reasonable to think that things are iid?
They are capable of thinking that they have all the same parameters, that they are independent, etc. Well, one thing I wanted to add is that I'll probably be there tonight. so I will try to use it in your spirit. I don't know what happened at first in the spirit of using my iPad and fancy things. I'll try to post some videos, particularly for those who have never used a statistical table to read, say the quintiles of a Gaussian distribution, okay, so there are several of you. I'll do it, this is a simple but boring exercise. I'll just post a video on how to do this and you can find it on stellar.
It'll take five minutes and then you'll know everything there is to know about them, but that's something you need for the first problem set, by the way, so the problem set has 30 exercises and the probability is that you'll need to do 15 and you only need to hand in 15. hand in the 30 if you want, but you need to know that by the time we get to those things you need to know well, actually by next week you need to know what's there, so if you don't have time to do them all do the homework and then go back to your probability class to figure out how to do it, just do 15 easy things that you can do and return those things, but go back to your probability class and make sure you know how to do them all.
Pretty basic questions and those are things I'm not going to dwell on, so you have to remember that the expectation of the product of independent random variables is the product of the expectations. Some of the expected expectations of the sum are the sum of the expectations. these kinds of things that are a little silly but just require you to practice, so you know, have fun, those are simple exercises, you'll have fun remembering your probability class, so I'll see you on Tuesday or Monday.

If you have any copyright issue, please Contact