YTread Logo
YTread Logo

Principal Component Analysis (PCA) clearly explained (2015)

Jun 03, 2021
step Quest step Quest stack Quest hello and welcome to stat Quest stack Quest is brought to you by the friendly folks at the genetics department at the University of North Carolina at Chapel Hill, today we're going to talk about

principal

component

analysis

or PCA for In short, let's start with An example of

principal

component

analysis

in action. Here is an example PCA chart I got from an article I was reading. Shows groups of cell types. This graph was extracted from single-cell RNA sequencing data. About 10,000 were transcribed. genes in each cell and each point on this graph represents a single cell and its transcription profile.
principal component analysis pca clearly explained 2015
The general idea is that cells with similar transcription profiles should be grouped together and we see that in this graph we see that blood cells form a group that is different from plur potent cells that are different from neuronal cells and from dermal or epidermal cells, So the big question is how do you compress the transcription of 10,000 genes into a single point on a graph. The answer is PCA. PCA is a method of compressing a large amount of data into something. which captures the essence of the original data in this statistics search, we will learn all about how PCA performs this compression and also discover what these access tags refer to before diving into the nitty-gritty of PCA.
principal component analysis pca clearly explained 2015

More Interesting Facts About,

principal component analysis pca clearly explained 2015...

Let's cover some background material. We're going to have an introduction to Dimensions just to warn you. This is going to seem very, very simple, but wait. You'll be glad you did. It will keep your head. Since the explosion, if you can remember all the way to first or second grade, you will remember that one dimension is equal to a number line. Now imagine we had a mock RNA search data set for a single cell. Here I have labeled the genes with just a b and c and the read counts are 10 0 and 14 for those genes, we can plot these values ​​on the number just like we did in first or second grade a with 10 reads gets a DOT at 10 Gene B with zero reads you get a DOT at zero and lastly Gan C with 14 reads gets a DOT at 14, if we plot all the genes we could see something like this, a uniform distribution of transcript counts or we could get a non-uniform distribution. uniform transcript counts, some genes might not be transcribed much and would be on the left side of our number line and some genes might be transcribed a lot and would be on the right side of our number line, even though our number line is a very simple graph, we can obtain useful information.
principal component analysis pca clearly explained 2015
Now let's fast forward to fifth or sixth grade when we learned about two-dimensional graphs. We now have two axes instead of just one and we can now plot data from two different cells instead of just one. Here is a simulated RNA sequencing data set for two single cells, as before we have the same genes, but now we have read the counts from two separate cells, if you can remember, from fifth or sixth grade, the way we plot The data for gene A is that we go to 10 for cell one and we go up to eight for cell two and we put a dot there for gene B, we go over zero for cell one, so we don't move everything and we go up two for cell two and for gene C, we go over 14 and up 10.
principal component analysis pca clearly explained 2015
If we graph all the genes, we might see something like this. Here we see that the expression in the two cells is correlated, meaning that genes that are highly transcribed in cell one are also highly transcribed in cell two and genes that are low transcribed in cell two. cell one is also poorly transcribed in cell two or we could see that the expression in the two cells is uncorrelated, meaning that if a gene is highly transcribed in cell one, that tells us nothing about whether it is highly transcribed. or low in cell two. Well, maybe at some point when we did calculus we started drawing three-dimensional graphs, which is just a fancy graph that has depth with three separate axes.
We can now plot data from three separate cells, so now our imaginary RNA sequencing data set has data for three individual cells. cells and just like before, if we wanted to plot the data for gene a, we would go to 10 for cell one, eight for cell two, and then back to eight for cell three. We would then draw lines perpendicular to each axis to determine where they are. everyone gets together and then we put a dot there. I'm not going to do too many examples of this because you get the idea, so this is what we know about Dimensions so far, if we have the data of a cell we just need to have a one dimensional graph which is just a number line, if we have data of two cells then we need a two-dimensional graph which is just an of four separate cells?
You guessed it, we need a four-dimensional graph. The problem is that we can't draw it on paper and if we had data from 200 individual cells. You would need a 200 dimensional graph and there is no way we can draw it, so the question is whether all of those dimensions are super important or some more important than others to answer that question. Let's go back to a data set that simply has two cells and two dimensions. Hypothetically speaking, what if we had two cell data that looked like this? Here we see that almost all of the variation in the data is from left to right, that is, cell one has some genes that are low transcribed and some genes are highly transcribed, but it appears that all the genes in cell 2 are transcribed At the same level if we flatten the data being removed, the variation up and down our graph wouldn't look much different than it did before and if we flatten the data we could simply graph it with a single number line.
In this case, we can take two-dimensional data and display it in a one-dimensional graph without too much loss of information. Both graphs say that important variation is left to Here's another example of how some dimensions are more important than others TV and movies Television and movies are almost always in 2D, that is, they are shown on flat screens at home or in the cinema and we don't normally use sophisticated 3D glasses. when we see them in 2D, even though the subjects of the film are in 3D, it's okay, the third dimension usually doesn't add much to the story.
That's why when we spend the extra three or four dollars to see a 3D movie we're usually disappointed anyway. People look like people. Things look like things even when they have no depth and are flattened on a screen, basically a movie camera. It takes 3D information and flattens it to 2D without too much loss of information to summarize what So far we know that every cell we sequence adds another dimension and we also know that some dimensions are more important than others, so what does all this have to do with PCA? Well, PCA takes a high-dimensional data set, i.e. cells, and flattens it to just two or three dimensions so we can see it.
Try to find a meaningful way to flatten the data by focusing on things that are different between cells. We will talk much more about this later. For any biologist, this is like flattening a zstack of microscope images to create a single two-dimensional image for publication, so let's start with an example again, we'll start with two cells, here's the data before the genes were imagined, like this So I numbered them from A to I and here is a 2D graph of the data from two cells, generally speaking the points are distributed along a diagonal line. Another way to think about this is that the maximum variation in the data is between the two end points of this line, and generally the points also extend a little above and a little below the first line we draw.
Another way to think about this is that the second largest amount of variation is in the endpoints of this. new line we just drew if we rotate the entire graph, the two lines we draw form new We saw before that the data varies a lot left and right and the data varies a little bit up and down, note that all points can be drawn in terms of left and right and up and down like any other 2D graph, that is, we do not need another line to describe the diagonal variation. We have already captured the two directions.
We can have variation with these two lines. These two new or rotated axes that describe the variation in the data are principal components, principal component one or pc1, the first. The principal component is the access that covers the most variation in the PC2 data or the principal component number two is the access that covers the second most variation, so these are the general ideas that we have covered so far for each gene. We plot a point based on how many reads were from each cell principal component one captures the direction where the greatest variation is principal component two captures the direction of the second greatest variation what if we had three cells like before principal component one would span the direction of the largest variation and Principal component two would encompass the direction of the second largest variation, however, since we have another direction, we can have variation, we need another principal component, which is principal component number three, and it encompasses the direction of the third largest variation.
What would happen if we had four main component cells? one would encompass the direction of the principal component with the greatest variation, two would encompass the direction of the second principal component with the greatest variation, three would encompass the direction of the third greatest variation and you guessed it, principal component four would encompass the direction of the fourth greatest variation there is. . a principal component for each dimension or each cell in the data, if we had 200 cells we would have 200 principal components, the 200th principal component would span the direction of the 200th variation, hooray, now that we know what pc1 and PC2 are, we know which one is The of cells, not of genes, how?
So far the only thing we've talked about is how to plot genes to answer your question, let's go back to the original scatterplot for two cells, for now let's focus on principal component one, the length and direction of pc1. It is mainly determined by the circular genes, the genes at the endpoints or the extreme genes. Now we're just going to move the graph to the left side of the screen so we can put other interesting things on the right side if we wanted to. We could score the genes based on how much they influenced the number one principal component and here is a list of qualitative scores we could give to each gene.
Genes near the ends of the line, such as a and F, would have high scores because they greatly influence the pc1 genes. in the middle, like B and C, would have low scores, we could also use quantitative scores for each gene, so that genes with little influence on the principal component would get values ​​close to zero and genes with more influence would get numbers further away from zero. Genes at opposite ends of the line get equally large numbers but with different signs, so a could get a positive number like positive 10 and F, because it is at the other end of the line, could get a negative number. like -14 similarly, we could also classify genes. and how they influence principal component number two now we have two tables of genes and the influence they have on the principal components one is for principal component one and the other table is for principal component number two now that we have these two tables for the We can use the first two principal components to trace cells and not just genes.
We do this by combining the read counts of all the genes in a cell to obtain a single value. Here's how to do it, first we go back to the original read counts for each cell. We can then calculate a score for cell one by taking the read count of gene a and multiplying it by the influence of gene A on the principal component and adding it to the read count of gene B multiplied by the influence of gene B and doing that for everyone. genes here is a concrete example for gene a of cell one, we have 10 read counts and the influence that gene a has is 10, so the first part of this sum is 10 * 10 the second part of the sum is the read count for gene B, which is zero multiply the IED by the influence that gene B has, which is 05.
We simply continue multiplying and adding and multiplying and adding until we have made it toevery gene in the cell. For this example, we might end up with a number like 12, which would be our value for pc1. To calculate a value for principal component 2 we do the same as before except instead of using the weights or influences on principal component one we use the weights or influences on principal component number two, so in this case gene a has 10 reads and we multiply it by three because that is the influence that gene a has on principal component number two, we add the read count for gene B multiplied by the influence that gene B has on principal component number two, in this case that's 0 * 10 and we just do that for each gene again and we end up with a score for principal component number two and in this case could be six so we did the calculations for cell one we have values ​​for principal component number one and a value for principal component number two now all we have to do is plot it on a graph and if we create a graph where AIS X is principal component one and AIS Y is principal component number two, we can do what we did in the fifth. grade we just go from 12 to 6 and put our DOT right there, now we have to calculate the scores for cell number two and if we do the calculations by multiplying the read counts for each gene by the influence that each gene has on the principal component We could end up with numbers like two for principal component one and eight for principal component number two.
Again, we just plotted it like we did in fifth grade, if we sequenced a third cell and its transcript was similar to cell one, it would get scores similar to those in the cell and as a result, when we plot it on the graph, cell number three would be closer. to cell number one than to cell number two Hooray, we finally know how all the cells in this graph plotted these are the general ideas we covered so far the genes with the greatest variation between cells will have the greatest influence on the principal components , that is, genes highly expressed in some cells and not expressed in others, will have a lot of variation and influence on the principal components.
The first principal component captures the greatest variation in the data. The second principal component captures the second largest variation. in the data, you can use the original data and the first two principal components to get use the graph to identify. key genes do you see how the cells are distributed to the left and to the right up and down? If we wanted to know which genes have a strong influence by placing dermal cells on the left side of the graph and neurological cells on the right side, we could look at the influence scores on principal component number one and if we wanted to know which genes help distinguish blood cells neural and dermal cells, we could look at the influence scores on principal component number two, but wait, there's even more, yeah, there's a couple of diagnoses you can make.
What you should do if you are drawing your own PCA diagram, these are ways you can tell if your PCA is really worth anything. A diagnostic diagram is called a scree diagram where you plot how much variation each principal component can account for for what you want to see in this The diagnostic diagram is that most of the variation is

explained

by the first two principal components. Finally, here's a terminology alert. The way I've been describing things has been pretty intuitive, but there's actually a lot of technical jargon for principal component analysis. The numbers that describe. the weights for the importance of each gene to the principal component one I have been calling influence or weights, but in PCA terminology those weights are called loadings, a series of loadings is called a igen vector and that is all there is in PCA , so tune in below.
It's time for another exciting statistical mission.

If you have any copyright issue, please Contact