Predict The Stock Market With Machine Learning And Python

May 30, 2024

Hello, my name is Vic and today we are going to

predict

the

stock

market

using

machine

learning

. We'll start by downloading data at index s p 500, then clean the data and use it to train a model. We'll backtest to determine how good our model is and add a few more

predict

ors to improve our accuracy. We'll finish with some next steps that you can use to continue improving the model on your own. It will be a really fun and exciting project before we start Dataquest. In fact, I spent a lot of time predicting the

stock

market

, winning

machine

learning

competitions, and developing and selling algorithms.

There are many real world considerations when predicting the stock market than not. All the tutorials will show you this, so I'll show them to you today so you can build a higher quality project. Let's dig deeper at the end of this project, we will have created a machine learning model that can predict tomorrow's S p 500 index price given the history. data, we will also have tested this model on over 20 years of historical data so that we can be really confident in the predictions that it is making, okay, let's go ahead and get started, we will use the Jupiter laboratory for this project. also use jupiter notebook if you have it installed and the first thing we will do is import something called package and Finance.

More Interesting Facts About,

predict the stock market with machine learning and python...

This package calls the Yahoo Finance API to download daily stock and index prices and the first thing we will do. It's that we will initialize something called a ticker class that will allow us to download the price history of a single symbol. In this case we'll use the symbol gspc, which is index s p 500, so we'll go ahead and run that and then the next one. What we're going to do is we're going to go ahead and query the historical prices, so we're going to use the historical method and we're going to pass the period equal to maximum, which is going to query all the data from the beginning when the index was created, so let's run. that and we actually end up with a panda data frame which is very very nice and in this data frame each row is the price on a single trading day so non trading days are not included in this data, the columns are the opening price, so the price when the market opened the highest price during the day the lowest price during the day the closing price when the market closed and the volume, i.e. the total volume that was traded that day, so basically we are going to use these columns to predict whether the stock price will go up or down tomorrow, we also have these additional columns for dividends and stock splits, but we will not use them and in fact we will delete them later, so we'll take a look at the index of the s p 500 data frame and we can see that we have a datetime index and the index is this column on the left if you're not familiar with it and that column will let us later index and split data frame easily Alright, the first thing we're going to do is go ahead and plot the data into the data frame to plot the closing price against the index.

What this does is it shows the index, which is actually the trading days, the dates on the x-axis and shows the closing price on the y-axis so we can run it and get a nice graph of the price history of the s p 500 and We may really regret not having bought the index fund at any time. over the past few years, all right, we're going to do a little data cleaning here, we're just going to delete those extra columns that we don't need, so we're going to go ahead and delete the dividends column and we're also going to delete the stock splits column, these columns are more appropriate for individual stocks, not for an index, so we don't really need them and then the next thing we're going to do is set up our target, so this is what we're actually going to predict.

Using machine learning, this goal will be whether the price will go up or down tomorrow, so some people like to predict the absolute price, so trying to predict whether the stock price will be 17 or 18 tomorrow, the big problem That's your model. you can be extremely accurate, you can be very good at predicting the absolute price, but you can still lose a lot of money because ultimately, if you're buying and selling stocks, you don't care about being accurate in the absolute price, but you do care further. get precision on the directionality if the price will go up or down to know if I can buy the stock and then it will go up.

You can be very close to predicting the actual price and in fact be very behind in predicting whether the stock will go up or down, so what we're going to try to do is say on days when the stock goes up, can we really predict that they will go up and that way if we want to buy the stock we know we can buy it and the price will go up so our goal will be whether the stock will go up or down so first we will create a column called tomorrow and basically what this column will be will be tomorrow's price and us.

We'll use the pandas shift method to help us do this, so let me run this and then show you what actually happened, so we take the close column and then shift all the prices back one day so you can see the 3 of January 1950 tomorrow. The column now is the price of January 4th, the closing price, so now we have a column showing tomorrow's price and based on tomorrow's price, now we can set a target, so the target is what we are going to try to predict with machine learning. and really all we have to do with the target is have tomorrow's price be greater than today's price, so it will basically return a boolean value that indicates whether tomorrow's price is greater than today's price, but in We actually want to convert this to an integer so we can use it in machine learning, so we're going to use the as type method and pass in an integer, okay, it has type or that's int, so now we're going to show the sp 500 data frame and we can see that now we have a target column which is one. when the price went up, then that is when tomorrow's price is greater than today's price and it is zero when the price went down and this is what we are going to try to predict, the next thing we will do is there is a lot of historical data.

The data in this data frame and usually a lot of historical data is great, but with stock market data, if you go back too far, the market could have fundamentally changed and some of that old data may not be so useful for making future predictions, so what do we do? What we're going to do is we're going to delete all the data before 1990, so we're going to use pandas loca's loc method and basically say we only take the rows where the index is at least January 1, 1990 and we can take a look. and look what happened and you can see that there are only dates after January 1, 1990.

Now I wrote the dot copy here and I wrote the dot copy because if you don't, sometimes you can get a pandas configuration with warning of copy when you try to create a subset. a data frame and then remapping it so point copying helps us avoid that. Okay, now we have our data set up and we can start training our first machine learning model, so let's go ahead and do that. initial model I'm going to use something called a random forest classifier. I love using random forest as my default model for most machine learning for a few reasons, so a random forest works by training a group of individual decision trees with random parameters and then averaging the results of those decision trees, so that because of this process, random forests are resistant to overfitting, they can overfit, but it is harder for them to overfit than other models, they also run relatively fast, and can capture nonlinear trends. in the data for example the opening price is not linearly correlated with the target, for example if the opening price is 4000 vs 3000 there is no linear relationship between the opening price and the target, if the price aperture is higher, that doesn't mean The target will also be higher so that our random forests can detect non-linear relationships which, in stock price prediction, most relationships are non-linear.

If you can find a linear relationship, then you can make a lot of money, so let's initialize our model and pass in some parameters, so n estimators is the number of individual decision trees we want to train. The higher it is, the generally better your accuracy will be up to a limit that you can't. just get free accuracy by doing this higher and higher. I'm going to set it pretty low to make this run quickly for us, but you might want to try a minimum sample split with a higher value, this helps protect us against overfitting that decision trees have. a tendency to overfit if they build the tree too deeply if you don't know much about decision trees don't worry but setting a minimum sample split helps us protect against that overfitting, the higher we set it the less accurate the model. but the less it will overfit, you might want to experiment with this and just find the optimal number and then I'll set a random state equal to 1.

So a random forest, as you might have guessed, has some randomization built in, so the Setting a random state means that if we run the same model twice, the random numbers that are generated will be in a predictable sequence each time using this random seed of one, so if we rerun the model twice we will get the same results, which helps if you're updating or improving your model and want to make sure it's actually the model or something you did that improved the bug rather than just something random. Okay, now let's split our data into a train and a test set.

This is time series data and with time series data you can't use cross validation or you can but if you do your results will look amazing when you are training and horrible in the real world and the reason is if you use cross validation or Another way to split your training and test set that doesn't take into account the time series nature of the data: you will use future data to predict the past, which you simply can't do in the real world and will result in something. It's called leakage, where you're leaking information into the model, so if I asked you to predict the stock price tomorrow and gave you what the stock price will be in 30 days, you'd probably do better at predicting the stock price tomorrow.

What if I didn't tell you anything about the future, so we want the model to actually learn to predict the stock price and not for us to have some knowledge about the future that we are not going to have in the real world. The way we're going to split this data setup is that we're going to put all the rows except the last hundred rows into the training set and we're going to put the last hundred rows into the test set. I'll show you a more sophisticated way to split this and measure the error later, but for now we're just creating a simple reference model and this is the easiest way to split and then predict the predictors, so I'll create a list with all of them. the columns that we are going to use to predict the target now I like to be very explicit about the predictors because I have gotten burned before just using all the columns as predictors and then creating a model that looks amazing when I am training it has an accuracy of 100, but in the real world it doesn't work.

What's really easy to do is accidentally use the tomorrow column or the target column to even predict the target and then what happens is your model actually knows the future, which is correct. It's not going to happen in the real world, so we're going to use close, we're going to use open high and low volume, so those are going to be our predictors and then what we're going to do is we're going to go ahead and tune the model so that model.fit trains predictors, so we will use these predictor columns and then we will try to predict the target, so we will train the model, we are using the predictor columns to predict the target, so let's run that and it will take a little time to run now once it's done .

Our next step is to measure how accurate the model is. This is a really important piece of machine learning. it's doing what you think it is or not, so we're going to import again from scikit-learn, we're going to import something called precision score and all the precision score is what it says when we said the market would go up when when the target was one, it really went up, so what percentage of the times we said the market would go up, did it actually go up? And this is actually a very good error metric or precision metric for this particular case because I'm Let's assume in this case that we want to buy stocks and when we buy stocks we want to hold them and then sell them and we want to make sure that when we buystocks the stock price will actually go up, depending on what you want and what your goals are, you may want to adjust the type of error metric you are using to measure performance, but in this case we will use precision score, so we will generate predictions using our model with the prediction. method and we will pass our test set with the predictors so that it generates predictions.

These predictions are in a large array which is a little difficult to work with, so we're actually going to convert this into a pandas series and we're going to use the same index as our test data set. I have to import pandas, so let's import pandas and then we'll create this series and we can see that the predictions are now a series and it's a little bit easier to read and then. We're going to go ahead and calculate the precision score so we're going to calculate the percentage and the score using the actual target and the predicted target and we can see that this is not a very good precision score so when we said the stock price would go up . it only went up 42 percent of the timeIt's not great, it would be better to trade this model doing the opposite of what it tells us to do, but that's okay, we're going to improve this model and we'll be able to get more accurate predictions, so next up.

What we're going to do is we're going to quickly plot our predictions and to do that we're going to combine our actual values with our predicted values and we're going to use pandas' concat function to do that, so we're concatenating our test target which is our actual values and our predicted values and then we'll pass axis equals one, which means treating each of these entries as a column in our data set. Now we can plot this and what this shows us is the The orange zero line is our predictions and the blue line is what actually happened so we can see that we mostly predicted the market would go up and mostly it seems to have dropped, which explains why our predictions were so far off.

The next thing we are going to do is create a more robust way to test our algorithm, so currently we can only test it on the last hundred days, but if you are actually creating a stock price model and want to use it in the real world you want being able to test through several years of data because you want to know how your algorithm will handle many different situations, which gives you more confidence that it will work in the future, so what do we do? What we're going to do is do something called backtesting and to enable backtesting, the first thing we're going to do is create a predict function and this is basically going to summarize everything we just did into one function, so it's model fitting. using the training predictors and the target is generating our predictions, which are just model point prediction test predictors, then combine our model into a series which I'll actually just copy and paste.

The only difference here is that I gave the series of predictions a name and finally it combines everything the same as we did before and then at the end we will return our combined data frame with the actual values and the predictions which now we can do is write a function of backtesting that takes our sp 500 data, a machine learning model, our predictors, it also takes an initial value which we will set to 2500 and a step value, so what is the initial value? So when you post test you'll want to have a certain amount of data to train your first model, so each business year is about 250 days, so this means taking 10 years of data and then training your first model with 10 years of data and the step is 250, which means we will train a model for about a year and then continue. the next year and then the year after that, so what we're going to do is take the first 10 years of data and predict values for year 11, then we'll take the first 11 years of data. predict values for year 12, then we will take the first 12 years of data, predict the values for year 13 and so on and this way we will get predictions for many different years and we will be able to have more confidence in our model okay, so in this backtest function we're going to create a list called all predictions and uh, that's going to be a list of data frames where each data frame is the predictions for a single year and then we're going to create a function to loop through our data year by year and make predictions for all years except the first 10 or so and then we'll split our training and test data.

I'm going to use point copy to avoid that setup with copy warning and this code is doing exactly what I mentioned: it is creating the training set and the test set, the training set is all the years before the current year and the test set is the current year, so we will use our prediction function to generate our predictions, train test predictors and model, then we will add to all the predictions, we will add our predictions for the given year and then at the end we will concatenate all our predictions together so concatenation can take a list of data frames and combine them all into a single data frame so let's go ahead and run these and then what we can do is backtest our s p 500 data with the model that we created above and with the predictors that we created above and After finishing the backtest, we can begin to evaluate the error of our predictions, so let's first take a look at the predictions and see how many days we predicted the market would go up or down, so so that the value counts only count how many times each type.

A prediction was made so we can see that we predicted the market would go down in about 3000 days, we predicted the market would go up in about 2000 days and now we can see our accuracy score and we can take the target and we can take the predictions and this will give us our accuracy score, so in all of these rows for about 6,000 trading days we had an accuracy of about 53, so when we said the market would go up, it now went up 53 of the time. Is it that good or not? As a point of reference, what we can see is the percentage of days that the market actually went up and to do this we can look at the target value counts divided by the number of total rows and this will give us percentages of us, for what the s p 500 on the days we were looking at was actually up 53.6 percent of the days and down 46.3 percent of the days, so if all we had done was wake up every day and say : I'm going to buy and sell at At the end of the day, we actually would have been better off using this algorithm.

This algorithm performed a little worse than just the natural percentage of days the stock market went up, but it's okay now that we have supporting evidence, we actually have a lot of it. confidence in our model and our ability to test it, so the next thing we're going to do is add a few more predictors to our model and see if that improves our accuracy, so what we're going to do is create a variety of moving averages, so So if you're just a human analyst trying to predict whether a stock will go up tomorrow, some of the numbers you might look at are: Is the stock price higher today than it was last week? three months ago, a year ago, five years ago, and you could use all of those inputs to help you determine whether the stock is going to go up or down and we'll give the algorithm that information, so what are these horizons are horizons. in which we want to see the moving averages, then we will calculate the average closing price in the last two days, the last week of trading, which is five days, the last three months or so, which is 60 days of trading, the last year and the last four years. and then we will find the relationship between the closing price today and the closing price in those periods, which will help us know if the market went up a ton because if so, it may be due to a slowdown.

Is the market down a ton? so it may be due to a rally so we're just going to give the algorithm more information to help it make better predictions and then we're going to create a list called new predictors which will contain some of the new columns that we're going to create okay so we're going to cycle through these horizons and then we'll calculate a moving average against that horizon and take the average and then what we can do is create a couple of columns. then one will be called the relationship column and we will call it the closing relationship horizon so that closing relationship 2 closing relationship 5, etc. and then we'll add it to our data frame of the s p 500 data set, so all this is going to be is the closing price on the s p 500 divided by our moving average, so the first time you cycle, this will be the relationship between today's close and the average close of the last two days.

The second time you cycle, it will be the relationship. between today's close, the average close of the last five days, etc., we can also observe a trend, and a trend will simply be the number of days in the last x days, whatever the horizon, in which the price of the stock actually went up. and what we can do here is say the trend column is equal, so what we'll do is we'll use Shift again, but this time we'll move forward and then what we'll do is find the correct rolling sum of the target, so what's doing this?

Scroll up and find the sp 500 data frame, so what this will do is it will go on a given day, it will look at the last few days and you will see the average of the sum of the target, so if we are on January 8 of 1990, we will look at the last four five days and find the sum of the goal; there are only four days available so we couldn't actually calculate a moving sum, but let's say there's a fifth day here and it's able to do that and it would basically take the sum, so the sum of the number of days that the price of the stock actually went up okay and then we'll add them to the trend column of the new predictors ratio column. so let's go ahead and run that and now we should have some extra columns in our s p 500 data set and you can see there are a lot of nans so what's the problem with that when pandas can't find enough days or enough rows before of the current row to actually calculate a moving average will simply return nam, so this is the closing ratio two which is based on the moving average of the previous two days and including the current day, so January 2, 1990 there are no days before this so it can't actually calculate a moving sum, so it returns a moving average, so it returns nan on January 3, 1990, you can correct that, it takes the average of this day and the day previous and the same with all these columns, it's a little different for the trend because you can't include the current day, so here you are looking for two previous days and you don't include the current day because if you did you would be including today's target in that column, which will give you a leak and make your algorithm look awesome, but it won't work in the real world, so we're going to remove some of these extra columns using drop and add extra rows.

Sorry, the missing rows are fine, so now we see that our data starts in 1993, that's because of these columns trend 1000 and close ratio 1000, so we needed about four years of data to calculate it right , so let's see how they performed, let's update our model slightly and change some of our parameters to increase our number of estimators to 200 and we'll reduce our minimum sample split to 50 and keep our random state, so let's go ahead and run that and then We're going to slightly rewrite our predict function, so let me go up and copy and paste this, so here when you run dot predict basically the model returns 0 or 1.

What we really want is a little bit more control over how we define what becomes one. and what becomes zero, so we'll use the predict proba method and what this result is actually a probability that the row is zero or one, so we return the probability that the stock price will go down tomorrow and the probability The second column of this will be the probability that the stock price will go up and then what we want to do is set our custom threshold, so which by default the threshold is 0.5, so if there are more than 50 chances that the price will go up, the model will do so. returns that the price will increase, but we're actually going to set that threshold at 60, so this means that the model has to be more confident that the price will increase in order to actually show that the price will increase and what this will happen.

What we do is reduce our total number of trading days, so we will reduce the number of days that we predict the price will go up, butwill increase the possibility that the price will actually rise on those days, which fits very well with what we want. true, we don't want to make a ton of trades, we want to know that when we make a trade, the price will actually go up, but we don't want to trade every day, that's a way to lose money pretty quickly and the rest. of this function should be the same, so let's run that and then let's go ahead and run our backtest again and this time we'll pass in our new predictors.

You may notice that we are actually getting rid of the use of close, open, high, low. and volume columns and the reason for this is that those are just absolute numbers, so it's not very informative for the model, if the price today is $465, it doesn't tell me anything about whether the price will go up or down tomorrow. the ratios are actually the most informative part: what is today's price compared to yesterday's price compared to last week's price, that's why we removed those columns, so once the backtest is done, what What we can do is take a look at the stock counts again for the predictions, so you'll remember that last time there were about 3,000 days where you predicted the price would go down and about 2,000 days where you predicted the price would go up, so that was the value counted from the last At this time, the distribution is very different.

Now you can see that there were only a few days where we predicted the price would go up and that's because we changed this threshold. We asked the model to be more confident in its predictions before it actually predicted that. the price would go up and what this means is that we are actually going to trade, we will buy shares in fewer days, but hopefully and we are about to find out, hopefully we will be more accurate on those days, so we will check the accuracy score and see our target and then we'll see our predictions so let's run that and we can see when we buy a stock and when the model predicts the price will go up 57 percent of the time it will actually go up so this may not look very good right , 57 is a failing grade in most places, but it's actually pretty good, especially given that we're only looking at time series data and we're only looking at historical index prices.

This would actually make you money if you had traded it from 1993 to the present. Would you recommend using this model for trading? No, there are things you can add to it. To make it more precise, I'll talk about that, but it's actually a pretty good result given the data we had to work with and it's better than our baseline, which is why the stock is up about 53 percent of the days, but our model is actually on the days it says buy, the price actually goes up 57 percent of the time, so the model actually has some predictive value.

Well, let me summarize and then talk about things that you could do to extend this model, so we did a lot of things right. we download some stock data for the s p 500 index we clean and visualize the data we set up our machine learning target we train our initial model then we evaluate the error and create a way to backtest and very accurately measure that error over p over long periods of Then We enhanced our model with some additional prediction columns so if you want to continue expanding this model I would recommend considering a few things so there are exchanges that are open overnight so the SP 500 only trades during EE market hours .US, but there are other indices. around the world that open before the US markets open, so it might be worth looking at those prices and see if they can actually correlate correctly if an index on the other side of the world is rising, Does that help predict the sp500 bet you will make?

You can add news to include articles that arise about general macroeconomic conditions, such as interest rates, inflation, etc. You can also think about adding some key components of the S P 500, such as key stocks and key sectors, you may, for example, if technology is on a downturn, six months later the S P 500 may go down, it may not go down immediately, so That's another thing you can try. You can also try increasing the resolution. We're looking at daily data here, but you can try looking. hourly data minute-by-minute data branded data, even if you can get it, it's not always the easiest or cheapest to get, but if you can get that data, you can make more accurate predictions, so those are just some ideas on where You can take them, but as I know from personal experience that you can build quite a bit on top of this model and get pretty far if you want, so I hope you enjoyed this overview on how to build a machine learning model to predict the S P 500.

Watch Video & Subscribe

If you have any copyright issue, please Contact