Our first post on Principal Component Analysis (PCA) has attracted a lot of traffic and as it seems to be one of the more popular multivariate statistical methods for 2DGE data, we wanted to take the time to explain it in more detail – especially how we are using it in the 2D gel image analysis software REDFIN.
The first two sections below contain a very brief introduction to PCA, and it’s application to 2DGE data. More in-depth information can be found in text books on statistical methods, or by searching the Internet. If you are already familiar with the general ideas of PCA, you may want to skip ahead to the sections describing the PCA plot in REDFIN. There is also a tutorial video on how to use PCA plot, available on the Ludesi website.
Introduction to PCA
PCA involves a mathematical procedure to find a smaller set of synthetic variables that capture the variance in an original data set. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
The results of a PCA are usually discussed in terms of component scores and loadings. The loadings tell how the principal components are related to the original variables and the scores show how much of the variance of each data point (sample) that is associated with a particular principal component.
PCA can be used as an exploratory tool to identify unknown trends in a multidimensional data set and to find samples or variables that tend to vary in the same way. If a multivariate dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a “shadow” of this object when viewed from its (in some sense) most informative viewpoint.
PCA has many other names. For instance, it is equivalent to the discrete Karhunen-Loève transform (KLT), the Hotelling transform, the proper orthogonal decomposition (POD), or the Singular Value Decomposition (SVD) of the covariance matrix of the data set.
Redfin uses Standardized (or Auto-scaled) PCA, meaning that the variance of each variable is normalized before the principal components are calculated. This is appropriate when the variables are not directly comparable, e.g. if they have different units.
PCA for the analysis of 2DGE data
For the purposes of PCA, each sample (gel image) constitutes one data point, described by the volumes of the protein spots found in the image. So each sample is described by several thousand variables, each with its own variance (gel-to-gel spot volume difference).
Finding the major principal components means finding the combination of spot volumes that best describe the gel-to-gel differences. The loading of the major principal components, i.e. the information about what spots/proteins contribute to the principal component, can indicate which proteins vary the most between samples.
The scores for each sample characterizes that sample in terms of the principal components. Thus, samples with similar scores are in some sense similar. If the samples are shown in a 2- or 3-dimensional graph with the two (three) first principal components on the axes, samples with similar behaviour will tend to “sit together”.
The PCA plot in REDFIN
The PCA in REDFIN is performed using the samples in the currently selected comparison and the spots in the current list as input. This means that if not all gels are included in the groups in the current comparison, you will be studying a subset of your data. Applying filters to the list of proteins means that you select a subset of the possible variables (spot volumes) describing your samples. This will
of course affect the results shown in the PCA plot and it’s interpretation. Sorting the list of proteins has no effect on the PCA. No direct information about the groups is used for the PCA.
The axes in the 3D-plot represent the three most important principal components. Each principal component is a linear combination of the volumes of the proteins, i.e. the score for a particular sample (it’s position along the principal component axis) is calculated by summing the spot volumes found in that gel image, weighted by constant (positive or negative) factors that is different for each protein ID.
By moving the mouse over each axis, you can see how much of the total variance this axis contributes, and the IDs of the proteins with the largest weight in the principal component. Note that distances along different axes cannot be directly compared as they represent completely uncorrelated variables, i.e. they represent changes in the volumes of different proteins.
The plot shows the average position of the samples in each group, with an option to show all the individual samples. Things to look for could be outliers, or samples that cluster in the graph, which means the samples are similar or vary in a correlated way.
Things we’ll cover in part 2 of this post
Part 2 takes a closer look at things to consider or think about when you use PCA for 2DGE data. We will talk about things like standardization, a priori information, and missing values.
As always feel free to comment with questions or feedback.
Filed under: 2DE Knowledge Base , PCA, principal component analysis, statistics