Solving real world data science tasks with Python Pandas!

Jun 07, 2021

Hello, what's up everyone? Welcome back to another video. I'm very excited about this video today. In it we are going to solve

real

world

data

science

problems with Python

pandas

, so basically how is it going to go, we are going to take a set of

data

we do some initial processing and cleaning of that data and then once we let's have a little bit more polished data set, we'll start exploring that data and you know,

real

ly using Python

pandas

and Python matplotlib to extract meaning from that data and ultimately with that kind of similar extracted meaning with that analysis, we'll be able to answer real-

world

business questions from data like a data scientist or analyst would before starting.

I want to thank you for making this happen. I started with this video is a selection of four options, apparently you guys like this option, it was overwhelmingly the most voted and then I asked the question: do you know what I should include in this video? you guys gave me all kinds of really great answers, you give me too many. answers and I'm actually not going to be able to put everything you guys suggested I do into this video, but I'm going to make this a series, so if you find that you liked this video, be sure to subscribe so you don't miss out on future ones. series I know some of my ideas for the future kind of real-world data

science

problems will be sports analysis, maybe stock trading analysis, you know, maybe some specific kind of challenges that you guys find interesting, I think it will be , I think it's going to be a fun way to do a lot of different videos and a lot of different real-world data analysis things, so I hope you like this video.

More Interesting Facts About,

solving real world data science tasks with python pandas...

To begin this tutorial, you will need to download the data in order to do so. to my github page, which is linked in the description and the data is in sales data from sales analytics, so that's 12 months of sales data right there, so go back to the page's home directory from github and there are two options to get the data. So if you're familiar with git, you can fork this repository and then clone it locally and there are instructions on how to do that or within the settings of this repository and the other option is to click this green button up here and download the zip and then continue and extract this wherever you want to work on the code, so put the YouTube code as a slow state, the audience always like this distracts you and then I recommend the following.

Personally, I like to use Jupiter notebooks to do my analysis, so I'll open a Jupiter notebook and do this. If you don't have a Jupiter notebook, I'll put a link to set it up and how to do that in the description too, so check it out and I like the Jupiter notebook because you can do whatever you want, you can write your code and actually display it like a good analysis that you can give to someone else, all in the same place without any extra effort, that's why I like Jupiter notebooks, okay, so go to this and sales analysis and as you can see, there is already a Jupiter notebook finished which is like the complete code.

We're going to create a new Jupiter notebook to start from scratch and I'm just going to call this analysis throughout this video. To present you with concrete

tasks

for you to try to solve in the way that you will probably get the most out of this video, every time I present a new task try to solve it on your own and then if you can't. find out, go ahead, play the video and then I'll explain the solution to you. You don't have to do that, but I think it's a good recommendation to get some real hands-on experience.

Okay, to start, let's import what is necessary. libraries we need so to start it will just be pandas and if you don't have pan installed I will put a link in the description to install pandas. So what is our first task? So, the first task will be us. we have in that repository we have all the data that I showed you, so we have the sales data folder and each of them in this has 12 months of data, so the first task will be to merge 12 months of sales data into one file only so go ahead and try this on your own and then I'll explain how to solve it and we want to do that because it will also be easier for us to do all kinds of annual analysis when we have everything merged. unlike 12 separate files, we will do this task.

I would say it's a simple start and to start simple you're just trying to read a single month of data, so if you remember our data, we're currently working on an analysis file. our data is in sales data and then let's say we want to get April 2019 sales dot CSV so we can do it using DF equals PD dot read CSV dot slash our current directory and then it was sales data sales bar from April 2019 dot CSV so this should give us just one month of data where we're at and then if you really wanted to see, I ran out of Shift + Enter by the way, I did it quickly D F dot head will give us the first five rows and as you can see, look, we have a couple of products, orders, the order dates, the purchase address, we have our cool data, so now how do we take that and actually take each month and merge it into one, so I mean every time I'm

solving

? something like this I usually do a lot of Google searches, so I want to say that Stack Overflow is our biggest friend when we solve these data science problems.

I don't remember half of the things I feel like I end up using, but I can know what I'm looking for and search for it so you know you would open like a Google window and type something like read all the files in the Python directory, how do I list all the files In the address book? Let's see what it is. It says that the operating system boot list directory will give you everything that is in a directory and gives you an example. Look great so we can use the list directory. I'm going to import it like this, I'm going to import it here and I'm going to run it again so we get all of our imports at that point what we're going to do is make files, let's say equal files and we're going to do a list comprehension here, we're going to say file by file in the operating system dot list directory and I think we can pass a path here so let's do sales data and hopefully files will give us everything we're looking for so now we do it for files and files .

I know that's not what I wanted. I copy, I will only say impressions. file and see what happens and I'm going to get rid of this DF, the header here, look it's getting twelve months worth of data doing file by file in the OS thoughtless directory, so now we just need to take these files and concatenate them. to create a single CSV, so how can we do it? You know I'm doing the same thing here behind the scenes. You know, I'm saying either you know how to concatenate data frames or you know maybe you don't. Know the working string, maybe say like a merge or something, you'll probably get the same answer in pandas, pandas, I can slice, okay, this seems useful, so you may know I'll look into some documentation and now We're all going to concatenate them into a single CSV, okay, so we'll start and I'll say something like everything before we get into the for loop, we'll probably define an empty data frame to store all of our data.

Let's say that all the months data is equal to the PD data frame, so when we read the CSV we read it in an acetate frame, so if we want to create an empty data frame, we do something like this and now we go a What we need to do is that every time we add a new file we will have to read the name of the file and remember that we started it with this path, so finally we will print or move this CSV read here. we're going to do DF equals, we could honestly copy this whole line DF equals PD uh treaty rate and now instead of this, we're going to do a sales data bar plus a file because that would give us the same thing because every time that we do this We will get the appropriate file name so it should read all the data frames but now we need to add them to the data of all the months so we can do it by doing all the months and you know there can be several ways to do it, this is equivalent to PD concat. and we are going to concatenate all the data from the previous months and the result of our current data so DF here is fine and then when we get to the end it should have concatenated each month because we iterate over each file and we should be able to do it every month . data header and see what we get, see if anything works.

I mean, it ran, so now it's a matter of having all the data for the months. Can we save them and does it have everything? So I'm going to make a CSV file. I'm going to do everything. months of data in CSV and if you're not familiar with some of these commands, I definitely recommend you watch the first video I posted on Python pandas. I'm going to show it above me and it's going to cover all of these things, so I'm I'm going to make all the data, I'm going to call all the data CSV and I'm going to make index equal to false because I don't want to save the values of this first column and let's see what happens, like this which was executed and now we have all. from our data moment of truth, okay, so I'm going to go back one directory, all CSV data points, okay, that looks good, so I'm going to open it real quick, whatever, okay, low download alone, okay, so we have April, it seems like four. but now if I scroll down, do we have the other months?

Mona shoot, oh my god, load, why are you so slow? I see eight, yeah, I mean, we have the months here, you can scroll to double check, but it looks like we have everything here. So this is a good starting point and I recommend quickly adding a column to read in the updated data frame so that you don't have to run this every time you want that file of all the data, so we can do it by just making all the data . is equal to the data in the pdds field and all the data headers will show us the first five rows and as you can see, we have everything in this all the data, so that's the updated data frame.

Now that we have all the data in one place in a single data frame, start doing some analysis, so the first question I want you to answer is what was the best month for sales based on all the data and do you know how much money was won in that month and let's break it down into a bunch. before we do it, but feel free to try to solve this entire question on your own, but I would like to break this down into a couple of smaller

tasks

, so let's insert a few more cells, basically what I'm going to recommend is before we do the best month for sales.

Aude meant: we fan the data with some additional columns because that will ultimately be useful in our analysis. The first thing is that here we have an order date, but we don't easily have the month, so I think we want to add a specific column for the month, so I'm just going to call this little section where we're adding surge data with additional columns and we'll add a couple more for inserts. Oh hi, I would just add. a couple of cells because I want a little extra space to work with, okay, so we want to plot the data with some additional columns, that's one thing we want to do, so the first column we want to add and just arrange all of this is a month or so, I guess this is something like tasks to add the month column, okay, add the month column, so with this and anything else in Python pandas and Python in general, there are several ways to add a month column, it's kind of a balance between what's easier versus what's easier to read versus what's more scalable, etc., so looking here we don't have the month currently, but we do have dates for each one of our products that sold, so immediately what comes to mind is The simplest solution to adding a month column would be to simply take the first two strings or the first two characters of the date string and convert it to the month column, so that's what we'll start with.

I'm doing it, okay? How do we do it? that and you know, if you at least have an approach and know what you're trying to do, you can usually use Google to your advantage and just search how to do the details of that approach to add the new column that we can do. all the month data just say something like three to start well if we wanted to just to show that we can easily add a new column like that as you see the month everything is three right now so it's okay to just add a number but we want to add the proper reading of the date from the order date column, for this we can do all the data and then we will take the order date and then when we want to convert this to a string, we can or like in the string properties of this cell. and this whole column where you can do a dot string and then we can access it just like we can access a normal Python string, so I can do zero to two and that will take the first two characters and honestly, I think it should be Well, let's see what it looks like and all the data look at that, so it looks like we have thefourth month everywhere.

The only thing I have a problem with right now is that it's definitely a chain here and I think it should probably be a month. a numeric value, to do that we can do some conversions pretty easily, so I'm going to go ahead and use the as type method in pandas to make it look like I'm taking the same column that we have and I'm basically just doing some manipulation in what we just saved, so we have the zero and I'm going to make all the data for a month as type and it doesn't really matter what type of integer we use because the months are just 1 to 12. but I'm just going to say in 32 and that should now convert it to numeric values, let's see if it works, no it doesn't work and why it doesn't work, okay, let's see, you can't convert float n to n to integer funfun, so it seems. like we have some NA extremes in our data and we're going to have to clean them up so before we finish this task 2 let's start cleaning our data okay so let's just create a new section in our little Jupiter notebook here we add a couple of more cells.

I don't know how many cells we'll need, but I'm going to make the first thing here the main point to clean the data. Well, the first thing we wanted to clean up was that. finally we had rows that indicated there was some final value somewhere and let's see if we can figure out where it is so we don't see anything here, but what we could do is show some more data, maybe it shows like 50 points and let's see. if we see something there I still don't see it na n maybe let's go to 100 okay it's not going to show this what we could do okay so we can't see it by doing this and you know sometimes you only have like one or two and not numbers in the tens of thousands of rows, so even though you can't see it right away, what you could maybe do is in Excel or something, sort the data and see it, but we also use pandas to figure out exactly how many extremes na we have and what they look like as far as the ranks go, so let's do that.

I'm in a very quick moment, just type a discount sale, we want to delete rows from n to n, so to do that. Let's first figure out if you know where our na endpoints are, so I'm just going to call something like na n data frame and we'll use this to filter and get all the na end rows that we have. and we don't know exactly where, I don't know if I ever messaged, so that's where I was. I guess it's probably because we're trying to convert this month to int. It will be in the column of the month, but maybe it will be other places.

Also, then how do we get the numbers that aren't? We can do a handy Google search and see, let's say find Rose with na and pandas. How do you like Rose with one of our panda data frame nulls? Well, Stack Overflow will do it immediately. if you tell us to see, that's boring or maybe I'm going to say that instead of selecting a rose, it looks like this one has already been clicked, it shows a rose with one or more n values and a panda data frame, you can use a panoramic data frame. about any with parameter axis is equal to one, this probably sounds good to me, let's try this.

Okay, this is the command. I said we wanted to call it na NDF, so I'm just going to remove this literal part and our data frame is not called data frame. is called or DF is called all data so we can replace this with all data and hopefully this Stack Overflow post helps us and if we do n a n D F we can see what our NA endpoints look like and see what they are like . an entire row from either end, so basically it's not like we just need to fill in a single order date value like we're missing, we have rows that are completely blank, so let's try to go ahead and drop all of these rows in to do that , we can use a command called drop na and just like I found this command, you can also do a Google search and find out how to drop the NA ends, so let's see that we have all the data that we are going to say or that's just yeah, all the data is equal to all data and we can do drop na a, that will be the command and if you look up the documentation for drop in a, there are two valleys or there is a prime minister called howl and you can do any one that drops a row. if you just had a single na n, but maybe you had all these normal values that were filled in, but in our case we found a bunch of these rows with all the values that are extremes, so instead of putting in a any, we're going let's do everything and let's see what happens now if we set all the data to this value and I'm just going to go on vacation, I'll just see if it's okay, it looks good, but our real test will be if we run this and try to convert the month comb to numbers integers will give us the same can't convert float any end to integer error and this is the moment of truth yes we still get an error please don't be the same error 'invalid oh look at that different error' literal invalid of four inches with base 10 or so, now we need to figure out what we need to clean to fix this problem.

Okay we have this problem so let's try to figure out what is causing this or what appears when we try to convert to an inch so let's do it in our cleanup section so let's first think about what would be causing that but it's not clear why that would happen that, but ultimately what I'm thinking is that we take the first two characters. and we paste them here so my instinct is right, for some reason, oh, the letters O R are in the first two values here, so let's try to figure out where that happens. It's okay to do it, we can also use that same function. we use to take the first two string values, we can do something similar to filter all this data, so I'll just say that temp DF for now is equal to all data and then we'll filter this so you can do it like this or I think also location would work here but basically we are trying to index all the data based on a certain condition so we are passing a condition here and the condition we want is that the first two characters of this order date be equal to or R because that is which is causing our problem and that's what we want to clean up, so to do that we're going to do something like this: sort a date string from zero to two just like we did when we received the month and then we want it to be equal to or I just want to see where it's happening and then we can eliminate it.

Well, look at this so we can see our problem, for whatever reason, the columns here are duplicated throughout our data frame, so ultimately if we can delete all of these. row types, hopefully we won't have errors after that, so we filter all the data by the equal values either to get the values that are not equal or we can just change this is equal to not equal and instead make this template data we will just reset it to be all the data so now if we do this we will get all the new data and if we are lucky this new data all now have the n A's removed and we have gotten rid of the duplicate column headers which were scattered throughout the data frame.

Hopefully now we can add our month column and convert it to an int without any problems. Well, we did that, but okay, now that we've added the month column, let's revisit the question we were trying to answer what was the best month for sales and how much was made that month so you know we've now gotten something that will allow us easily filter by month, but the other biggest question I see in this question to try to answer is sales and while we have the quantity ordered and the price of each, ultimately to get the sales values, we will want to multiply this for the price of each and get kind of a sales value per order so add another column so I'm going to do this task 3. add a sales column so ultimately this will help us actually answer this question one, so add a sales column, well how do we make this fair, should it be fair?

Simple like I said, you know the quantity ordered, the time, the price, each will give us our sales, so we just do all the data sales. We can say that that is equal to the entire amount of data requested and we can do that. Here is a nice syntax where we can simply do a multiplication. symbol and then we do the price of all the data each and if we then print what our data frame looks like, BEC is fine, you can't multiply the sequence by nan int of type string, so it looks like we actually have a little more cleaning to do it.

It sounds like you know that the reason we probably have strings is that even though they look like numbers, they're actually encoded as strings, so as an added bonus, we're going to convert columns to the correct type, so we'll start with the two we have. we were just trying to deal with this, so we want to sort the whole amount of data, we want it to be a type of int, so let's do int and then we want to do the price of each one, as we see here, we want it to be a float, so the price we each want to make this a make float, so before when we made an integer value, where did I do that?

Yeah, I forgot where I put it, we did it as a guy. There's actually another way we can do numeric values in pandas and it's called PD dot two numeric and it handles a little bit more because you don't have to like fine grain, let's say you want it at 32, it'll just figure out what the type is. correct to which you should convert it, so we We're going to use that here to convert our columns to the correct type, so we'll just do PD to numeric and I'm really going to keep reiterating this, but when you do this analysis, as long as you have an idea.

Think about what you want to do, select the data here, we want to convert it to numeric type. If you use Google to try to figure out the exact way to do it, you'll usually get lucky, so it's really just a question. of being able to have the logical thinking to know what you want to do with the data because finding the syntax is usually pretty accessible over the good old Internet, so pd-2 is numeric and then we'll move on to that sorted amount of data. I think this should be the syntax we want and we will do the same with the price each year.

Okay, let's run that and see what happens. Well, we have no errors. Let's just print our head. It's okay, it looks the same as before. Ultimately, I think we'll really know if we can run that sales column and not have any problems, so let's see if we add a sales column by doing this. Notice that 23 90 is 2 times 11 95, so it looks like we have what you If you want, you can reorder this column here based on the quantity requested and the price of each. I'm not going to explain how to do it here, but I covered it in my previous pandas video so you can check it out.

If you're really curious, ultimately it doesn't matter where our columns are located when we perform the analysis, so I'm fine with leaving it here on the right side now that we've successfully accomplished that. We added our quiet month outside of our sales column. Let's finally answer this question. What was the best month for sales and how much was earned that month? Well we can do this pretty easily with a group by function so we'll take all the data we take group by well we want to find out what the best month is so we can group by month column and then what we want to do is summarize the values, so we group by month and then add, so let's see what happens.

If we do that, well, we have 12 months and we get the sales values here and if we just wanted to see the sales values we could type sales here and as you can see, I don't really like that. As much as we go back now we can answer our question and then what was the best month for sales? As you can see here, December was the best month for sales with approximately four million six hundred ten or six hundred thirteen thousand dollars in sales. that month and then the worst month was January with 1.8 million dollars in sales and if we wanted we could also plot this because I think plotting a lot of times is a good way to visualize these results and see the trends month by month and maybe do it . a more detailed analysis on maybe why certain months are higher than others and it's easier to see that in this graph, so we could import matplotlib but pipe a lot like PLT and I could go through this graph a little quicker, but if you want to know To learn more about matplotlib, feel free to watch the two videos I posted in the library.

They should probably walk you through everything, but we want to make a bar graph. I think it will be good for this data as our x value that we want. months, so I could reproduce the months in a couple of ways, but I'm going to just do months that equal the range 1 to 13, this 13 is exclusive, so it's actually going to give me 1 to 12, that's what we want, so I'll pass in months. such as my we only want thewhat are our X values? Well, they're going to be the single value, so there's a yeah, this single command.

I think that will give us the single column. values for a specific or unique value in a specific column, so if I made all the city data, single point, single, that should get us and I made the cities that should give us all the different cities that we have, so this is all what you see here. so I can pass this and make these the X ticks as well and let's see what happens. It looks like we have a plot, ah, it's really ugly, what else can you do right for X ticks instead of having it listed horizontally because that's what's causing it? the collisions and I guess this is no longer the number of the month, this is the city name, so the US city name, but what we can do with the X's is we can pass a rotation equal to vertical and maybe we'll make the font a little bit smaller, so I'll see that the font size is equal to 8 and we'll see that we have all our cities, but I see a problem: what are we doing wrong?

When we did the sum here, we found that San Francisco had the best sales number, however, when we look at this The graph we just made Austin is the best sales, so what's going on? You know, I was going over this code before doing part of this tutorial and I came across this. I wondered what the hell, why did you break me? Why are you showing inconsistencies? The problem is that when we display the Y data, the order is important, but when we do this, all the city data is unique, it sorts it in a different way, so our X data and our Y data are not in the same order, so ultimately we need to make these things equal, make cities have the same order as these sales results, to do that we can do a list comprehension.

This is something I found on Stack Overflow, but basically what I found was you. You could do a City by City data frame, basically you need two different parameters here and I realized this just by looking at the stack overflow post, but all the data is grouped by city actually, I don't think you need the city in parentheses and this will now give me the cities in the same order as they were when we did the group by some, so hopefully this set will work. Look at that, yeah, by changing the way we arrange our keys, our cool SuperDuper electronics from Keith and as a data scientist, again, you might ask me a few questions.

Do you know why San Francisco is so much higher than other places? Well this is an electronics store, maybe it's because Silicon Valley needs more electronics, so that would be one of the reasons why you know maybe the advertising technology is better in San Francisco or more people have more money in San Francisco. You know, I could start creating something. hypotheses about the data about why that value is what it is and the same for the lowest value is like, you know, why Portland Maine is the smallest well of all these cities that I see here, it's the smallest city, maybe the advertising is bad there, you know?

You can use this graph to help you understand your data and like it. Provide key information to business owners that will hopefully improve sales in the future. Alright, the next question we're going to answer is going to be quite commercial E and that question is what time should we show ads to maximize the likelihood of the customer purchasing the product and so if we remember what our data looks like, it looks so and I would say that if we're going to use our data to answer this question, what we're really going to need to look at is this order date. and basically we're going to need to find a way to add all the order dates in there, like distribution over a 24 hour period, so how can we do that?

So the first option we could do is maybe like parse this as a string just like we handle the month column and just get the first two digits which could potentially work, but you know I'm a little worried about, you know if this order date changes, you know the format of You know, maybe taking the time here doesn't stay exactly the same, so what we'll do instead is convert this order date into a date/time object and basically the date/time library time in Python allows us to act really easily. accesses different parts of a date, you know the hour, hour, minute in a very

python

ic-like way, so it's going to be a lot less hacking than trying to parse this string manually and trying to hope that the format stays exactly the same in the date.

Library/time, even if this date format changes a bit, it's a pretty clever tool and can fill in the blanks to figure out exactly which part of the date means what, sorry, I've got you guys saying the order date here, I want all. data order date to be converted to a datetime object. I think right now it is a string and therefore to convert it to a date/time object we can do PD dot to date/time, so like earlier in the tutorial we did PD dot. two numerics to convert a row of strings to a row of integers and floats, we can do this to convert this column to a date/time column and the only potential caveat to doing this is that the datetime format will probably be larger than the string that is currently stored here as far as memory space goes, but I don't think in most cases, as long as your data is relatively or not huge, generally the performance here won't be terrible, so I think it's worth doing so let's run Nats and this takes a bit of time because there's a bit of a considerable calculation so once you do this once you won't have to do it again so it should run quickly again after converting it. column to the right during the day and I can display the data again and you know it has changed slightly, now you see it has a different syntax, if I really wanted and wanted to keep the original formatting, I could have gone ahead and duplicated this column to be like the order date and then the datetime format in parentheses or something, but I'm okay with replacing it like we did and I know the cool thing about this is if I want to add the time because ultimately you might want group by the hour or maybe the minute the hour in the minute I'm going to add a time column to our data and it's really simple now that we've converted that column to datetime format all I have to do is fill in all the data. take that order date column and now I have access to do this dot date syntax and then once I do dot date, I can do dot and see this so we have like the eighth hour here, hour 22 here, the 14th hour here and now. when I run this line and I'll just show the data after we see we get 8 here 22 here 14 here it just knows which part of this date is the hour so that's cool and we can also increase this so it's also do the minute by doing all the data order date point date point minute and you know these kinds of things.

I keep repeating this in this tutorial, you know, I'm not always remembering this right away. I'm just trying to find intelligent information. ways to work with dates and it leads me to maybe the documentation which leads me to this or maybe it leads me to a stack overflow which leads me to this so look at this now we have what I do all day to sort the date what an idea that was Oh, I should probably have an equal sign here, cool, and look at that, so 846 becomes 846, so now that we have these two columns we can go ahead and determine what the best time is and you know, this It's basically a repetition. from question 1 in question 2, now let's make a group and then plot some things similar to how we did in this one.

I want to use this format to take our keys and to start off I'm going to say we want to have the hours, the time for our DF and all the data grouped by and then we're going to take the hours column so these are going to be our hours keys which we will use in our chart and this time instead of using the bar chart we used. In the graphs above I'm going to make this look a little more continuous and use a line plot so we can do this as long as you've loaded it into matplotlib somewhere in this. tutorial now you can make a dot plot, we have our hours and then what is our Y data?

Well, our Y data will be the result of all the data that I group into the hours column and then we don't need to do that. Summarizing these different values here in these columns, what we can do in this case is just do a count to see what's going on there. Great, that looks good, let me, so we made a group of all the data by our and then we just counted the number of rows for each hour, so I could, if it could be helpful, just print this quickly and you'll see that all of these are the count, which is the number of order occurrences for that specific time, so if we plot it, we just After we've done that, we get this nice graph and then what does this graph tell us?

Well, maybe it would be helpful to add some tags so everyone would have a couple of tags real quick, so what do we want? So maybe we'll make the X mark the hours. just so we can see every hour easily, so what's going on there? So we have all these brands now. One thing that's still difficult is that I'm having a hard time visually seeing what each tick on this graph corresponds to, so what can I add as well. This graph is the grid, which is a nice little feature that makes it much easier for us to see what time is probably best, so let's look at this, we have peaks around 11am. m. or 11 11 11 a.m. m., as you want. you say your x and then another peak around 19, so around 7:00 p.m. i.e. 11 a.m. and 7:00 p.m. that makes sense, it's like you know early in the day that maybe you're doing your homework or like at 7:00 p.m.

It makes sense, maybe it's an after work type of deal, so those are the peak times that people order, so if I had to answer this question, I would say we should probably do it, if we had to advertise effectively, what is the question what time should we? displaying immaculate ads maximize the likelihood that customers will buy. I would say this graph shows us that maybe just before 11:00 a.m. m. is a good time to put an ad or 12 is a good time to put an ad or you know, 6 p.m. m. or 7:00 p.m. 18 or 19 here would be good times to show ads because those were the peaks where all the orders in our data set occurred and we could label this graph the same way we did for the others if we wanted, so you could do the same.

Great, that sounds pretty good to me and this is something you could present to a business person and say, "Look at this data, this is from last year, let's target our ads right now and maybe break it down a little bit." . a little more because this graph was for all cities, maybe you could make specific graphs for certain cities, everything is worth playing with, the next question is which products are most often sold together, so feel free to pause the video and try to solve this on your own. I'll note that I thought this was a particularly difficult question, so congratulations T, if you can figure it out without seeing my solution, then how would you go about

solving

this?

I guess the first thing you should do is present the problem. The note is how do we know if a product is sold together with another product? If you look at these order ID values, I guess we can see one right here, if they have the same order ID, that means this Google phone and these wired headphones were ordered together because you know they have the same order ID and, as you can see, they are also delivered to the same address, so we are basically trying to find out, by counting all the duplicate order IDs, which products sold the most together. and yes, like I said, I didn't think this was particularly easy.

I thought it was quite challenging so don't worry if you can't get it on your own because I was looking on Stack Overflow to figure it out myself and I. was the one who asked the question, so how would I do it right? The first thing I wanted to know is that we want to know all the points in our data frame. I assume they are half duplicate rows because those are the only rows that we have to worry about having duplicate order IDs because ultimately that will allow us to find which products were sold together, so let's create a new data frame for this and we will take all our data and we will filter it by all the data.

Order ID and then duplicate it with dots to check all the rows or all the cells in the Order ID column and see which ones are duplicated and then we will pass this keep equals false,so now I'm going to show See what this data frame looks like and we'll pass in let's say 20 values here, as you can see, this and this have the same order ID, this and this, how the same water ID, this and This will sometimes see as three values. It's possible that you had three values that were all sorted together, but yeah, it just removed that big data frame into just the duplicate so we can work with this data frame now to do the extra stuff and I think in the final type. of solution Jupiter notebook that I have reference in which second were four links that I used here and another useful trick is that if you are like you know that you understand this command but you don't understand why I passed keep equal to false here if you are using a Jupiter notebook at end of the column that was executed, you can change the tab and open the docstrings so you see this maintenance parameter and if I expand this maintenance it is basically whether or not you should keep the first occurrence of the duplicate kit if you should keep the last occurrence of the duplicate or if you should keep everything like all the duplicates and that's why we pass and false because we want to keep everything good, so what do we do now that we have?

All these duplicates, well, we need to start grouping them, so I'm going to create a new column in our data frame and I'm going to call it grouped and this is where some weird things are going to happen that I basically want. to create a new column that will have the Google phone and the wired headphones on the same line, that's when you see the magic that I'm about to put on the screen, that's ultimately what I'm trying to get at. of having them on different lines, let's put them all on the same line and to do that we're going to add a column called grouped and we're going to group it by the order ID now from our DF filter, which are the only duplicating rows. so we have our order ID and then what we're looking at specifically within the order ID is the product column and then we'll use a function called transform and transform is similar to apply like where it takes as lambda X if I want to edit the actual content of the cell and so for all the order IDs that group my product that are grouped together, for each of those Hopefully it works.

We will say that this may take a little time. It seems to have worked. Read what it says. I'm not sure what's happening. He asks me to use a look, so maybe I should use look somewhere, but everything is working. You see we have Google phone comma corded headphones here in the group to call them so that looks good and now let's think the only problem I have is that we have this same order twice because we did this in every situation in the group , so we have two and three exactly the same, so let's remove those duplicate occurrences of the same order as the same pairs in the order, so Let's get rid of the duplicates here and to do that we can make DF equal to DF and we're just going to worry about the order ID and the group values and because otherwise some of these other things might be different, like if we accidentally took the index. but we probably don't have to filter it exactly like that, but we'll take those two columns and remove the duplicates, so what do we get now and your point D F?

Oh, this we remove duplicates like this, cool, looks good. For me, as we saw before, this now only happens once, so when we're actually counting these ones or iterating over all the rows and our final DF, we can now use quite a bit of 100, that was a lot. Basically, we're going to iterate over all of these rows and count these pairs now and use them to get our final count of which products are sold together most often. Okay, let's move on to another cell to do this last part. Well, so we have data. framework that looks like this and now we need to count the pairs of what occurs together most frequently and we will use a couple of new libraries for this.

We are going to import itertools that we are going to use between tools. We are going to import combinations and from the collections we are an import counter and, as I said before, in these columns in these cells, when you look at the final solution for this, I will have pasted it. there, as you can see, I just did with this the stack overflow posts that I linked and that I looked at when I was doing this, but okay, so how do we do this right? Basically, yes, I looked at this stack overflow so I could reference this. one and see how we can use it to help us, so let's look at the Stack Overflow post.

We have a comma separated list of values that we can think about like we could probably turn that grouped column that we have into a formula. like that, but we have a list of lists that have that and then we update the count based on the joins in the sublists, so if we can duplicate this form format, we can do it with one and ultimately get a dictionary like this. so let's do it real quick, the first part of that answer was to have a counter, so we counted the counter and then it was for the sublist and the list, we basically had to iterate, so we want to use this as our sublist to Let's do that, we can do it for the row in guru or the data frame and the grouped column and this will now give us each of these entries.

Now we want to get a list because in that original sack overflow post we have a sublist. so in the list of rows we could split the row points on the comma and get a sublist which is pretty good, then we can copy the format with the counter with our own variable names to be counted to update the list of rows of counter combinations. and then we want to count pairs of twos, so if I wanted to count as the most common three-way occurrences, I could do a unit pass on three here, so now we have that counter, so if I print the count, I think we should see something good. ah, that's pretty complicated, but I think it tells us what's the first thing you phone on a lightning charging cable, those were the items that were most commonly ordered together, but if we want a better format of this, we could count the most points. common 10 this was with a counter object there is a more common method and that should yeah it's a little bit better where we can see the pairs in a more readable way and if we want to be even more readable from that we could make a key of four value in the most common count and then print just the key comma value and that will give us this, which I guess is a little bit easier to read and I'm actually just curious or would it be quick, would we get the three most common in a row if we did? with the three here, look at that, so if you want to see the most items that sold most frequently as three items together, do it after the three in the year, but we'll keep it as two and that's great.

What would we do with this? Well, you know, maybe try to be smart with the promotions you offer, if you're selling an iPhone, maybe try to be smart and offer a smart deal to attract more people. People probably already want to buy an iPhone. extra Lightning charging cable, but maybe make a smart deal where you can attract an even larger audience to buy that Lightning charging cable and you can repeat that with these other common pairs and maybe try to use it to your advantage. As a business user, this data would help a company make decisions like that.

Well, let's end this video with one last question and this final question will definitely be a little bit simpler than the last one and it will be what products will be sold. the most and why you think it sells the most, so how would we go about doing this? We can start with the same old group, buy things, so all the data, let's look at our data and really everything we have to do to calculate. To find out which products are selling the most, we need to summarize the quantity ordered based on grouping by product, so we can make all the data grouped by product, so I will do this slightly different than what I have done in the past just to keep things clean. pretty clean, the product group is equal to all the data grouped by, so if I printed the product group, I don't know if it will help or not do much yet, but if we go ahead and display the product groups, we can use the order quantity. to figure out the best things and we should probably keep it as a graph again, that's usually the best play, so the quantity ordered if we wanted to abstract it to a variable would be equal to the sum of product group points and then the quantity ordered, all good, and then we can graph it just as we did with the previous ones.

I'm going to make this a bar chart again, so maybe I'll take this to have the products say product by product DF and this would be the product group instead. it happened oh I wrote wrong ah no it smells bad again okay and now we want to plot the plot point bar our X will be the products and our y will be the ordered quantity which looks good oh look at those labels they are nice ones, let's do the rotation that we did in the last of the last examples, I'm just going to paste it in Jesus, oh, I pasted it from the wrong place, look you have a nicer graph there and if you want to get rid of this annoying stuff that comes before and you're using Juber Novik, just make the graph, look at that, okay and now we should probably add a label if we want to plot and the label is equal to a t or just the ordered number and then the ordered quantity.

I guess it's probably better. based on the quantity ordered you know it takes into account the fact that you can order two of the same product and that all comes down to this and then our label X was label X was the product okay so now that I have this graph, you know The original question was which product sold the most and why do you think it's almost okay. We can see that the Triple A batteries were the most popular item at Keith's fancy Smancy electronics store. Immediately you might have an idea why it might sell more. let's look at the other best selling items, the Lightning charging cable, the USBC charging cable, wired headphones, dual batteries and why it would outsell an LD dryer and an LG washing machine, well the immediate reaction I have to the look This data is that those items are cheap like triple-A batteries and LG dryer is not cheap so that's probably why it's much higher than the others but you know as a data scientist , that it is often good to test our hypotheses.

What we could do is overlay this graph with maybe the actual prices of these items and see if there's that kind of direct correlation. Okay, so let's make a new cell to do this. We would start by getting the prices and to do it. We could group by product and in addition to the sum and the count that you've seen in this video, there are also midpoints and what we want, well, we want the price of each one, which will be a kind of average price that these are sold for. things. each one okay so let's print the prices and just Shh look you have all the prices of these objects that's pretty simple okay now that we have that now we need to figure out how we overlay this data in a box on this graph and add a secondary y-axis.

I guess that would be the price here on the right, but keep the same We go out and do this, so we want to add a second y-axis to a matplotlib plot and let's see what we get when we query this. Okay, so with a second access AND to the matplotlib plot, you get a post here, the top result is from matplotlib org and You often know that Math Motley org does some good things, but I often find that the easiest solutions come with the sack overflow post so I'll take the first one here and see if I can tailor the answer to what I'm trying to do better so the question is adding a y-axis label to a secondary y-axis and a math library and this answer It has a lot of upvotes and I've always been judging the efforts to make sure it looks pretty.

A lot of people liked it, it uses this figure x1 subplot, so we're going to want to use subplots and I think basically we could probably copy this code pretty exactly and then adapt it to what we need, so I'm going to do that. so we pace all of this and instead of plotting X and y1 like they did, ultimately we'll plot what we use as our like our X Pasadena or the Y label with the quantity ordered like we did before and then hereit should be a bar that we could keep that and we could play with that in a second set the tag the bar chart is the quantity ordered and the second why label is the price and we could say in US dollars if we want but I'll just do the price here okay let's hope this works.

Look, it's still there. a little ugly here's a little thing I had already written because it was a little complicated, but we can set up the X mark labels with the axis using this format. It's a little different than plotting X point marks, but that should do the rotation. what we've been doing look at that cool cool and then we should get the price to sync colors to sync so the ordered quantity has a green color here so let's also pass this as a green color okay cool now we have a chart that is overlaid our chart that is overlaid on another chart and in blue we have the price and in green we have the quantity ordered so if the price and the quantity ordered correlate and that's what we're trying to like test our hypothesis as long as the quantity ordered is high, it should be low and, as we see, that is the case of triple-a and double-a batteries, while we move on to other products such as the LG dryer, which is quite expensive and has sense that you know might have caused it to be low here and then you see some inconsistencies with that, like you can say, hey, the price is too high for the Mac laptop, the MacBook Pro laptop, and the ThinkPad laptop, why What is the ordered quantity? much higher than the LG dryer and LG washing machine, well, you know, I guess more people are looking to buy MacBook Pros, you know, there are more students in the world, there is more demand for MacBook Pros, so this overlap can't tell us everything, but I would say it's a pretty good test of why double-A and triple-A batteries are so high.

Lightning charging cable is so high. The USBC charging cable is so high. Those are prices so low that the quantity ordered eventually skyrockets. up because there are more people willing to pay that price. Well, with that we're going to end the video here. I hope you learned something and also had fun with this tutorial. I know I personally had a lot of fun doing this tutorial curating everything from data to asking business questions to using data to analyze those questions, so I hope you enjoyed that process too. If you enjoyed this video, it means a lot to me if you give me the thumbs up.

I want to make more videos like this in the future. but the only way I'll know how to do it is if you guys show interest in this one, so while you're at it, give me a big round of applause, subscribe, and if you want more. to date on what I'm doing, you know, day in and day out, check out my Instagram and Twitter, until the next video, thanks again guys for watching Peace Out.

Watch Video & Subscribe

If you have any copyright issue, please Contact