Читать книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. - Страница 9
Оглавление3
Principal Components Analysis: A Brief Visit
Centering and Scaling: An Example
The Importance of Exploratory Data Analysis in Multivariate Studies
Dimensionality Reduction via PCA
Principal Components Analysis
Like PLS, principal components analysis (PCA) attempts to use a relatively small number of components to model the information in a set of data that consists of many variables. Its goal is to describe the internal structure of the data by modeling its variance. It differs from PLS in that it does not interpret variables as inputs or outputs, but rather deals only with a single matrix. The single matrix is usually denoted by X. Although the components that are extracted can be used in predictive models, in PCA there is no direct connection to a Y matrix.
Let’s look very briefly at an example. Open the data table Solubility.jmp by clicking on the correct link in the master journal. This JMP sample data table contains data on 72 chemical compounds that were measured for solubility in six different solvents, and is shown in part in Figure 3.1. The first column gives the name of the compound. The next six columns give the solubility measurements. We would like to develop a better understanding of the essential features of this data set, which consists of a 72 x 6 matrix.
Figure 3.1: Partial View of Solubility.jmp
PCA works by extracting linear combinations of the variables. First, it finds a linear combination of the variables that maximizes the variance. This is done subject to a constraint on the sizes of the coefficients, so that a solution exists. Subject to this constraint, the first linear combination explains as much of the variability in the data as possible. The observations are then weighted by this linear combination, to produce scores. The vector of scores is called the first principal component. The vector of coefficients for the linear combination is sometimes called the first loading vector.
Next, PCA finds a linear combination which, among all linear combinations that are orthogonal to the first, has the highest variance. (Again, a constraint is placed on the sizes of the coefficients.) This second vector of factor loadings is used to compute scores for the observations, resulting in the second principal component. This second principal component explains as much variance as possible in a direction orthogonal to that of the first loading vector. Subsequent linear combinations are extracted similarly, to explain the maximum variance in the space that is orthogonal to the loading vectors that have been previously extracted.
To perform PCA for this data set in JMP:
1. Select Analyze > Multivariate Methods > Principal Components.
2. Select the columns 1-Octanol through Hexane and add them as Y, Columns.
3. Click OK.
4. In the red triangle menu for the resulting report, select Eigenvalues.
Your report should appear as in Figure 3.2. (Alternatively, you can simply run the last script in the data table panel, Principal Components.)
Figure 3.2: PCA Analysis for Solubility.jmp
Each row of data is transformed to a score on each principal component. Plots of these scores for the first two principal components are shown. We won’t get into the technical details, but each component has an associated eigenvalue and eigenvector. The Eigenvalues report indicates that the first component accounts for 79.75% of the variation in the data, and that the second component brings the cumulative total variation accounted for to 95.50%.
The plot on the far right in Figure 3.2, called a loading plot, gives insight into the data structure. All six of the variables have positive loadings on the first component. This means that the largest component of the variability is explained by a linear combination of all six variables with positive coefficients for each variable. But the second component has positive loadings only for 1-Octanol and Ether, while all other variables have negative loadings. This indicates that the next largest source of variability results from a difference between a compound’s solubility for 1-Octanol and Ether and the other four solvents.
For more information about the PCA platform in JMP, select Analyze > Multivariate Methods > Principal Components. Then, in the launch window, click Help.
Centering and Scaling: An Example
As mentioned in Chapter 1, multivariate methods, such as PCA, are very sensitive to the scale of the data. Open the data table LoWarp.jmp by clicking on the correct link in the master journal. These data, presented in Eriksson et al. (2006), are from an experiment run to minimize the warp and maximize the strength of a polymer used in a mobile phone. In the column group called Y, you see eight measurements of warp (warp1 – warp8) and six measurements of strength (strength1 – strength6).
Run the first script in the table, Raw Y. You obtain a plot of comparative box plots, as shown in Figure 3.3. Most of the box plots are dwarfed compared to the four larger box plots.
Figure 3.3: Comparative Box Plots for Raw Data
The plot shows that the variables strength2 and strength4 dominate in terms of the raw measurement scale – their values are much larger than those of the other variables. If these raw values were used in a multivariate analysis such as PCA or PLS, these two variables would dominate.
We can lessen the impact of the size of their values by subtracting each variable’s mean from all its measurements. As mentioned in Chapter 1, this is called centering. The Columns panel in the data table contains a column group for the centered variables, called Y Centered. Each variable in this group is given by a formula. To see the formulas, click on the disclosure icon to the left of the group name to display the 13 variables in the group. Next, click on any of the + signs to the right of the variable names. You see that the calculated values in any given column consist of the raw data minus the appropriate column mean.
Run the script Centered Y to obtain the box plots for the centered data shown in Figure 3.4. Although the data are now centered at 0, the variables strength2_Centered and strength4_Centered still dominate because of their relatively high variability.
Figure 3.4: Comparative Box Plots for Centered Data
Let’s not only center the data in any given column, but let’s also divide these centered values by the standard deviation of the column to scale the data. JMP has a function that both centers and scales a column of data. The function is called Standardize. The column group Y Centered and Scaled contains standardized versions of each of the variables. You can check this by looking at the formulas that define the columns. Run the script Centered and Scaled Y to obtain the comparative box plots of the standardized variables shown in Figure 3.5.
Figure 3.5: Comparative Box Plots for Centered and Scaled Data
As mentioned in Chapter 1, we see that the act of centering and scaling (or standardizing) the variables does indeed place all of them on an equal footing. Although there can be exceptions, it is generally the case that, in PCA and PLS, centering and scaling your variables is desirable.
In PCA, the JMP default is to calculate Principal Components > on Correlations, as shown in Figure 3.6. This means that the variables are first centered and scaled, so that the matrix containing their inner products is the correlation matrix. JMP also enables the user to select Principal Components > on Covariances, which means that the data are simply centered, or Principal Components > on Unscaled, which means that the raw data are used.
Figure 3.6: PCA Default Calculation
The Importance of Exploratory Data Analysis in Multivariate Studies
Visual data exploration should be a first step in any multivariate study. In the next section, we use some simulated data to see how PCA reduces dimensionality. But first, let’s explore the data that we use for that demonstration.
Run the script DimensionalityReduction.jsl by clicking on the correct link in the master journal. This script generates three panels. Each panel gives a plot of 11 quasi-random values for two variables. The first panel that appears when you run the script shows the raw data for X1 and X2, which we refer to as the Measured Data Values (Figure 3.7). Your plot will be slightly different, because the points are random, and the Summary of Measured Data information will differ to reflect this.
Figure 3.7: Panel 1, Measured Data Values
In Panel 1, the Summary of Measured Data gives the mean of each variable and the Variance-Covariance Matrix for X1 and X2. Note that the variance-covariance matrix is symmetric. The diagonal entries are the variance of X1 (upper left) and the variance of X2 (lower right), while the off-diagonal entries give the covariance of X1 and X2.
Covariance measures the joint variation in X1 and X2. Because the covariance value depends on the units of measurement of X1 and X2, its absolute size is not all that meaningful. However, the pattern of joint variation between X1 and X2 can be discerned and assessed from the scatterplot in Figure 3.7. Note that, as X1 increases, X2 tends to increase as well, but there appears to be one point in the lower right corner of the plot that doesn’t fit this general pattern. Although the points generated by the script are random, you should see an outlier in the lower right corner of your plot as well.
Panel 2, shown in Figure 3.8, displays the Centered and Scaled Data Values. For the centered and scaled data, the covariance matrix is just the correlation matrix. Here, the off-diagonal entries give the correlation between X1 and X2. This correlation value does have an interpretation based on its size and its sign. In our example, the correlation is 0.221, indicating a weak positive relationship.
Figure 3.8: Panel 2, Centered and Scaled Data Values
However, as you might suspect, the outlying point might be causing the correlation coefficient to be smaller than expected. In the top panel, you can use your mouse to drag that outlying point, which enables you to investigate its effect on the various numerical summaries, and in particular, on the correlation shown in the second panel. The effect of moving the rogue point into the cloud of the remaining X1, X2 values is shown for our data in Figure 3.9. (The point is now the one at the top right.) The correlation increases from 0.221 to 0.938.
Figure 3.9: Effect of Dragging Outlier to Cloud of Points
We move on to Panel 3 in the next section, but first a few remarks. Remember that PCA is based on a correlation matrix. We see later that the same is true of PLS. This large change in the correlation highlights the importance of checking your data for incongruous samples before conducting any analysis. In fact, the importance of exploratory data analysis (EDA) increases as the number of variables increases. JMP facilitates this task through its interactivity and dynamically linked views.
In Chapter 2, “A Review of Multiple Linear Regression,” we saw that fitting a regression model produces residuals. Residuals can also form a basis for detecting incongruous values in multivariate procedures such as PCA and PLS. However, one must remember that residuals only judge samples in relation to the model currently being considered. It is always best to precede model building with exploratory analysis of one’s data.
Dimensionality Reduction via PCA
In the context of the data table Solubility.jmp, we saw that the first two principal components explain about 95% of the variation in the six variables. Let’s continue using the script DimensionalityReduction.jsl to gain some intuition for exactly how PCA reduces dimensionality.
With the slider in Panel 2 set at 0 Degrees, we see the axes shown in Figure 3.10. The vertical blue lines in Panel 3 show the distances of the points from the horizontal axis. The sum of the squares of these distances, called the Sum of Squared Residuals, is given to the right of the plot. This sum equals 10 for your simulated data as well as for ours. This is a consequence of the fact that the sum is computed for the centered and scaled data and that there are 11 data points.
Figure 3.10: No Rotation of Axes
But now, move your slider in Panel 2 to the right to rotate the axes until you come close to minimizing the Sum of Squared Residuals given in Panel 3 (Figure 3.11). The total length of the blue lines in the third panel is greatly reduced. In effect, the third panel gives a view of the cloud of data from a rotated coordinate system (defined by the red axes in the second panel).
Figure 3.11: Rotation of Axes
From this new point of view, we have explained much of the variation in the data using a single coordinate. We can think of each point as being projected onto the horizontal line in Panel 3, or, equivalently, onto the rotated axis pointing up and to the right in the second panel. In fact, PCA proceeds in just this manner, identifying the first principal component to be the axis along which the variation of the projected points is maximized.
You can close the report generated by the script DimensionalityReduction.jsl.