Читать книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. - Страница 10

Оглавление

4

A Deeper Understanding of PLS

Centering and Scaling in PLS

PLS as a Multivariate Technique

Why Use PLS?

How Does PLS Work?

PLS versus PCA

PLS Scores and Loadings

Some Technical Background

An Example Exploring Prediction

One-Factor NIPALS Model

Two-Factor NIPALS Model

Variable Selection

SIMPLS Fits

Choosing the Number of Factors

Cross Validation

Types of Cross Validation

A Simulation of K-Fold Cross Validation

Validation in the PLS Platform

The NIPALS and SIMPLS Algorithms

Useful Things to Remember About PLS

Centering and Scaling in PLS

Although it can be adapted to more general situations, PLS usually involves only two sets of variables, one interpreted as predictors, X, and one as responses, Y. As with PCA, it is usually best to apply PLS to data that have been centered and scaled. As shown in Chapter 3 this puts all variables on an equal footing. This is why the Centering and Scaling options are turned on by default in the JMP PLS launch window.

There are sometimes cases where it might be useful or necessary to scale blocks of variables in X and/or Y differently. This can easily be done using JMP column formulas (as we saw in LoWarp.jmp) or using JMP scripting (Help > Books > Scripting Guide). In cases where you define your own scaling, be sure to deselect the relevant options in the PLS launch window. For simplicity, we assume for now that we always want all variables to be centered and scaled.

PLS as a Multivariate Technique

When all variables are centered and scaled, their covariance matrix equals their correlation matrix. The correlation matrix becomes the natural vehicle for representing the relationship between the variables. We have already talked about correlation, specifically in the context of predictors and the X matrix in MLR.

But the distinction between predictors and responses is contextual. Given any data matrix, we can compute the sample correlation between each pair of columns regardless of the interpretation we choose to assign to the columns. The sample correlations form a square matrix with ones on the main diagonal. Most linear multivariate methods, PLS included, start from a consideration of the correlation matrix (Tobias 1995).

Suppose, then, that we have a total of v variables measured on our n units or samples. We consider k of these to be responses and m of these to be predictors, so that v = k + m. The correlation matrix, denoted Σ, is v x v. Suppose that k = 2 and m = 4. Then we can represent the correlation matrix schematically as shown in Figure 4.1, where we order the variables in such a way that responses come first.

Figure 4.1: Schematic of Correlation Matrix Σ, for Ys and Xs


Recall that the elements of Σ must be between –1 and +1 and that the matrix is square and symmetric. But not every matrix with these properties is a correlation matrix. Because of the very meaning of correlation, the elements of a correlation matrix cannot be completely independent of one another. For example, in a 3 x 3 correlation matrix there are three off-diagonal elements: If one of these elements is 0.90 and another is –0.80, it is possible to show that the third must be between –0.98 and –0.46.

As we soon see in a simulation, PLS builds models linking predictors and responses. PLS does this using projection to reduce dimensionality by extracting factors (also called latent variables in the PLS context). The information it uses to do this is contained in the dark green elements of Σ, in this case a 2 x 4 submatrix. For general values of k and m, this sub-matrix is not square and does not have any special symmetry. We describe exactly how the factors are constructed in Appendix 1.

For now, though, note that the submatrix used by PLS contains the correlations between the predictors and the responses. Using these correlations, the factors are extracted in such a way that they not only explain variation in the X and Y variables, but they also relate the X variables to the Y variables.

As you might suspect, consideration of the entire 6 x 6 correlation matrix without regard to predictors and responses leads directly to PCA. As we have seen in Chapter 3, PCA also exploits the idea of projections to reduce dimensionality, and is often used as an exploratory technique prior to PLS.

Consideration of the 4 x 4 submatrix (the orange elements) leads to a technique called Principal Components Regression, or PCR (Hastie et al. 2001). Here, the dimensionality of the X space is reduced through PCA, and the resulting components are treated as new predictors for each response in Y using MLR. To fit PCR in JMP requires a two-stage process (Analyze > Multivariate Methods > Principal Components, followed by Analyze > Fit Model). In many instances, though, PLS is a superior choice.

For completeness, we mention that consideration of the 2 x 2 submatrix associated with the Ys (the blue elements) along with the 4 x 4 submatrix associated with the Xs (the orange elements) leads to Maximum Redundancy Analysis, MRA (van den Wollenberg 1977). This is a technique that is not as widely used as PLS, PCA, and PCR.

Why Use PLS?

Consistent with the heritage of PLS, let’s consider a simulation of a simplified example from spectroscopy. In this situation, samples are measured in two ways: Typically, one is quick, inexpensive, and online; the other is slow, expensive, and offline, usually involving a skilled technician and some chemistry. The goal is to build a model that predicts well enough so that only the inexpensive method need be used on subsequent samples, acting as a surrogate for the expensive method.

The online measurement consists of a set of intensities measured at multiple wavelengths or frequencies. These measured intensities serve as the values of X for the sample at hand. To simplify the discussion, we assume that the technician only measures a single quantity, so that (as in our MLR example in Chapter 2) Y is a column vector with the same number of rows as we have samples.

To set up the simulation, run the script SpectralData.jsl by clicking on the correct link in the master journal. This opens a control panel, shown in Figure 4.2.

Figure 4.2: Control Panel for Spectral Data Simulation


Once you obtain the control panel, complete the following steps:

1. Set the Number of Peaks to 3 Peaks.

2. Set the Noise Level in Intensity Measurements slider to 0.02.

3. Leave the Noise in Model slider set to 0.00.

4. Click Run.

This produces a data table with 45 rows containing an ID column, a Response column, and 81 columns representing wavelengths, which are collected in the column group called Predictors. The data table also has a number of saved scripts.

Let’s run the first script, Stack Wavelengths. This script stacks the intensity values so that we can plot the individual spectra. In the data table that the script creates, run the script Individual Spectra. Figure 4.3 shows plots similar to those that you see.

Figure 4.3: Individual Spectra


Note that some samples display two peaks and some three. In fact, the very definition of what is or is not a peak can quickly be called into question with real data, and over the years spectroscopists and chemometricians have developed a plethora of techniques to pre-process spectral data in ways that are reflective of the specific technique and instrument used.

Now run the script Combined Spectra in the stacked data table. This script plots the spectra for all 45 samples against a single set of axes (Figure 4.4). You can click on an individual set of spectral readings in the plot to highlight its trace and the corresponding rows in the data table.

Figure 4.4: Combined Spectra


Our simulation captures the essence of the analysis challenge. We have 81 predictors and 45 rows. A common strategy in such situations is to attempt to extract significant features (such as peak heights, widths, and shapes) and to use this smaller set of features for subsequent modeling. However, in this case we have neither the desire nor the background knowledge to attempt this. Rather, we take the point of view that the intensities in the measured spectrum (the row within X), taken as a whole, provide a fingerprint for that row that we try to relate to the corresponding measured value in Y.

Let’s close the data table Stacked Data and return to the main data table Response and Intensities. Run the script Correlations Between Xs. This script creates a color map that shows the correlations between every pair of predictors using a blue to red color scheme (Figure 4.5). Note that the color scheme is given by the legend to the right of the plot. To see the numerical values of the correlations, click the red triangle next to Multivariate and select Correlations Multivariate.

Figure 4.5: Correlations for Predictors Shown in a Color Map


In the section “The Effect of Correlation among Predictors: A Simulation” in Chapter 2, we investigated the impact of varying the correlation between just two predictors. Here we have 81 predictors, one for each wavelength, resulting in 81*80/2 = 3,240 pairs of predictors. Figure 4.5 gives a pictorial representation of the correlations among all 3,240 pairs.

The cells on the main diagonal are colored the most intense shade of red, because the correlation of a variable with itself is +1. However, Figure 4.5 shows three large blocks of red. These are a consequence of the three peaks that you requested in the simulation. You can experiment by rerunning the simulation with a different number of peaks and other slider settings to see the impact on this correlation structure.

Next, in the data table, find and run the script MLR (Fit Model). This attempts to fit a multiple linear regression to Response, using all 81 columns as predictors. The report starts out with a long list of Singularity Details. This report, for our simulated data, is partially shown in Figure 4.6.

Figure 4.6: Partial List of Singularity Details for Multiple Linear Regression Analysis


Here, the X matrix has 81+1 = 82 columns, but X and Y have only 45 rows. Because n < m (using our earlier notation), we should expect MLR to run into trouble. Note that the JMP Fit Model platform does produce some output, though it’s not particularly useful in this case. If you want more details about what JMP is doing here, select Help > Books > Fitting Linear Models and search for “Singularity Details”.

Now run the script Partial Least Squares to see a partial least squares report. We cover the report details later on, but for now, notice that there is no mention of singularities. In fact, the Variable Importance Plot (Figure 4.7) assesses the contribution of each of the 81 wavelengths in modeling the response. Because higher Variable Importance for the Projection (VIP) values suggest higher influence, we conclude that wavelengths between about –3.0 and 1.0 have comparatively higher influence than the rest.

Figure 4.7: PLS Variable Importance Plot


As mentioned earlier, our example is deliberately simplified. It is not uncommon for spectra to be measured at a thousand wavelengths, rather than 81. One challenge for software is to find useful representations, especially graphical representations, to help tame this complexity. Here we have seen that for this type of data, PLS holds the promise of providing results, whereas MLR clearly fails.

You might like to rerun the simulation with different settings to see how these plots and other results change. Once you are finished, you can close the reports produced by the script SpectralData.jsl.

How Does PLS Work?

So what, at a high level at least, is going on behind the scenes in a PLS analysis? We use the script PLSGeometry.jsl to illustrate. This script generates an invisible data table consisting of 20 rows of data with three Xs and three Ys. It then models these data using either one or two factors. Run the script by clicking on the correct link in the master journal. In the control panel window that appears (Figure 4.8), leave the Number of Factors set at One and click OK.

Figure 4.8: Control Panel for PLSGeometryDemo.jsl


This results in a two-by-two arrangement of 3-D scatterplots, shown in Figure 4.9. A Data Filter window also opens, shown in Figure 4.10.

Because the data table behind these plots contains three responses and n = 20 observations, the response matrix Y is 20 x 3. This is the first time that we have encountered a matrix Y of responses, rather than simply a column vector of responses. To use MLR in this case, we would have to fit a model to each column in Y separately. So, any information about how the three responses vary jointly would be lost. Although in certain cases it is desirable to model each response separately, PLS gives us the flexibility to leverage information relative to the joint variation of multiple responses. It makes it easy to model large numbers of responses simultaneously in a single model.

Figure 4.9: 3-D Scatterplots for One Factor


The two 3-D scatterplots on the left enable us to see the actual values for all six variables for all observations simultaneously, with the predictors in the top plot and the responses in the bottom plot. By rotating the plot in the upper left, you can see that the 20 points do not fill the whole cube. Instead, they cluster together, indicating that the three predictors are quite strongly correlated.

You can see how the measured response values relate to the measured predictor values by pressing the Go arrow in the Animation Controls on the video-like control of the Data Filter window that the script produced (Figure 4.10). When you do this, the highlighting loops over the observations and shows the corresponding observations in the other plots. To pause the animation, click the button with two vertical bars that has replaced the Go arrow.

Figure 4.10: Data Filter for Demonstration


The two plots on the right give us some insight into how PLS works. These plots display predicted, rather than actual, values. By rotating both of these plots, you can easily see that the predicted X and Y values are perfectly co-linear. In other words, for both the Xs and the Ys, the partial least squares algorithm has projected the three-dimensional cloud of points onto a line.

Now, once again, run through the points using the Animation Controls in the Data Filter window and observe the two plots on the right. Note that, as one moves progressively along the line in Predicted X Values, one also moves progressively along the line in Predicted Y Values. This indicates that PLS not only projects the point clouds of the Xs and the Ys onto a lower-dimensional subspace, but it does so in a way that reflects the correlation structure between the Xs and the Ys. If you were given a new observation’s three X coordinates, PLS would enable you to obtain its predicted X values, and PLS would use related information to compute corresponding predicted Y values for that observation.

In Appendix 1, we give the algorithm used in computing these results. For now, simply note that we have illustrated the statement made earlier in this chapter that PLS is a projection method that reduces dimensionality. In our example, we have taken a three-dimensional cloud of points and represented those points using a one-dimensional subspace, namely a line. As the launch window indicates, we say that we have extracted one factor from the data. In our example, we have used this one factor to define a linear subspace onto which to project both the Xs and the Ys.

Because it works by extracting latent variables, partial least squares is also called projection to latent structures. While the term “partial least squares” stresses the relationship of PLS to other regression methods, the term “projection to latent structures” emphasizes a more fundamental empirical principle: Namely, the underlying structure of highly dimensional data associated with complex phenomena is often largely determined by a smaller number of factors or latent variables that are not directly accessible to observation or measurement (Tabachnick and Fidell 2001). It is this aspect of projection, which is fundamental to PLS, that the image on the front cover is intended to portray.

Note that, if the observations do not have some correlation structure, attempts at reducing their dimensionality are not likely to be fruitful. However, as we have seen in the example from spectroscopy, there are cases where the predictors are necessarily correlated. So PLS actually exploits the situation that poses difficulties for MLR.

Given this as background, close your data filter and 3-D scatterplot report. Then rerun PLSGeometry.jsl, but now choosing Two as the Number of Factors. The underlying data structure is the same as before. Rotate the plots on the right. Observe that the predicted values for each of the Xs and the Ys fall on a plane, a two-dimensional subspace defined by the two factors. Again, loop through these points using the Animation Controls in the Data Filter window. As you might expect, the two-dimensional representation provides a better description of the original data than does the one factor model.

When data are highly multidimensional, a critical decision involves how many factors to extract to provide a sound representation of the original data. This comes back to finding a balance between underfitting and overfitting. We address this later in our examples. For now, close the script PLSGeometry.jsl and its associated reports.

PLS versus PCA

As described earlier, PCA uses the correlation matrix for all variables of interest, whereas PLS uses the submatrix that links responses and predictors. In a situation where there are both Ys and Xs, Figure 4.1 indicates that PCA uses the orange-colored correlations, whereas PLS uses the green-colored correlations. These green entries are the correlations that link the responses and predictors. PLS attempts to identify factors that simultaneously reduce dimensionality and provide predictive power.

To see a geometric representation that contrasts PLS and PCA, run the script PLS_PCA.jsl by clicking on the correct link in the master journal. This script simulates values for two predictors, X1 and X2, and a single response Y. A report generated by this script is shown in Figure 4.11.

Figure 4.11: Plots Contrasting PCA and PLS


The Contour Plot for Y on the left shows how the true value of Y changes with X1 and X2. The continuous color intensity scale shows large values of Y in red and small values in blue, as indicated by the legend to the right of the plot. The contour plot indicates that the response surface is a plane tilted so that it slopes upward in the upper left of the X1, X2 plane and downward in the lower right of the X1, X2 plane. Specifically, the relationship is given by Y = –X1 + .75X2.

The next two plots, Principal Components and PLS Weights, are obtained using simulated values for X1 and X2. But Y is computed directly using the relationship shown in the Contour Plot.

The Principal Components plot shows the two principal components. The direction of the first component, PC1, captures as much variation as possible in the values of X1 and X2 regardless of the value of Y. In fact, PC1 is essentially perpendicular to the direction of increase in Y, as shown in the contour plot. PC1 ignores any variation in Y. The second component, PC2, captures residual variation, again ignoring variation in Y.

The PLS Weights plot shows the directions of the two PLS factors, or latent variables. Note that PLS1 is rotated relative to PC1. PLS1 attempts to explain variation in X1 and X2 while also explaining some of the variation in Y. You can see that, while PC1 is oriented in a direction that gives no information about Y, PLS1 is rotated slightly toward the direction of increase (or decrease) for Y.

This simulation illustrates the fact that PLS tries to balance the requirements of dimensionality reduction in the X space with the need to explain variation in the response. You can close the report produced by the script PLS_PCA.jsl now.

PLS Scores and Loadings

Some Technical Background

Extracting Factors

Before considering some examples that illustrate more of the basic PLS concepts, let’s introduce some of the technical background that underpins PLS. As you know by now, a main goal of PLS is to predict one or more responses from a collection of predictors. This is done by extracting linear combinations of the predictors that are variously referred to as latent variables, components, or factors. We use the term factor exclusively from now on to be consistent with JMP usage.

We assume that all variables are at least centered. Also keep in mind that there are various versions of PLS algorithms. We mentioned earlier that JMP provides two approaches: NIPALS and SIMPLS. The following discussion describes PLS in general terms, but to be completely precise, one needs to refer to the specific algorithm in use.

With this caveat, let’s consider the calculations associated with the first PLS factor. Suppose that X is an n x m matrix whose columns are the m predictors and that Y is an n x k matrix whose columns are the k responses. The first PLS factor is an m x 1 weight vector, w1, whose elements reflect the covariance between the predictors in X and the responses in Y. The jth entry of w1 is the weight associated with the jth predictor. The vector w1 defines a linear combination of the variables in X that, subject to norm restrictions, maximizes covariance relative to all linear combinations of variables in Y. This vector defines the first PLS factor.

The weight vector w1 is used to weight the observations in X. The n weighted linear combinations of the entries in the columns of X are called X scores, denoted by the vector t1. In other words, the X scores are the entries of the vector t1 = Xw1. Note that the score vector, t1, is n x 1; each observation is given an X score on the first factor. Think of the vector w1 as defining a linear transformation mapping the m predictors to a one-dimensional subspace. With this interpretation, Xw1 represents the mapping of the data to this one-dimensional subspace.

Technically, t1 is a linear combination of the variables in X that has maximum covariance with a linear combination of the variables in Y, subject to normalizing constraints. That is, there is a vector c1 with the property that the covariance between t1 = Xw1 and u1 = Yc1 is a maximum. The vector c1 is a Y weight vector, also called a loading vector. The elements of the vector u1 are the Y scores. So, for the first factor, we would expect the X scores and the Y scores to be strongly correlated.

To obtain subsequent factors, we use all factors available to that point to predict both X and Y. In the NIPALS algorithm, the process of obtaining a new weight vector and defining new X scores is applied to the residuals from the predictive models for X and Y. (We say that X and Y are deflated and the process itself is called deflation.) This ensures that subsequent factors are independent of (orthogonal to) all previously extracted factors. In the SIMPLS algorithm, the deflation process is applied to the cross-product matrix. (For complete information, see Appendix 1.)

Models in Terms of X Scores

Suppose that a factors are extracted. Then there are:

a weight vectors, w1, w2,...,wa

a X-score vectors, t1, t2,...,ta

a Y-score vectors, u1, u2,...,ua

We can now define three matrices: W is the m x a matrix whose columns consist of the weight vectors; T and U are the n x a matrices whose columns consist of the X-score and Y-score vectors, respectively. In NIPALS, the Y scores, ui, are regressed on the X scores, ti, in an inner relation regression fit.

Recall that the matrix Y contains k responses, so that Y is n x k. Let’s also assume that X and Y are both centered and scaled. For both NIPALS and SIMPLS, predictive models for both Y and X can be given in terms of a regression on the scores, T. Although we won’t go into the details at this point, we introduce notation for these predictive models:

(4.1) X^=TP'

Y^=TQ'

where P is m x a and Q is k x a. The matrix P is called the X loading matrix, and its columns are the scaled X loadings. The matrix Q is sometimes called the Y loading matrix. In NIPALS, its columns are proportional to the Y loading vectors. In SIMPLS, when Y contains more than one response, its representation in terms of loading vectors is more complex. Each matrix projects the observations onto the space defined by the factors. (See Appendix 1.) Each column is associated with a specific factor. For example, the ith column of P is associated with the ith extracted factor. The jth element of the ith column of P reflects the strength of the relationship between the jth predictor and the ith extracted factor. The columns of Q are interpreted similarly.

To facilitate the task of determining how much a predictor or response variable contributes to a factor, the loadings are usually scaled so that each loading vector has length one. This makes it easy to compare loadings across factors and across the variables in X and Y.

Model in Terms of Xs

Let’s continue to assume that the variables in the matrices X and Y are centered and scaled. We can consider the Ys to be related directly to the Xs in terms of a theoretical model as follows:

Y = Xβ + εY.

Here, β is an m x k matrix of regression coefficients. The estimate of the matrix β that is derived using PLS depends on the fitting algorithm. The details of the derivation are given in Appendix 1.

The NIPALS algorithm requires the use of a diagonal matrix, Δb, whose diagonal entries are defined by the inner relation mentioned earlier, where the Y scores, ui, are regressed on the X scores, ti. The estimate of β also involves a matrix, C, that contains the Y weights, also called the Y loadings. The column vectors of C define linear combinations of the deflated Y variables that have maximum covariance with linear combinations of the deflated X variables.

Using these matrices, in NIPALS, β is estimated by

(4.2) B = W(P'W)-1ΔbC'

and Y is estimated in terms of X by

Y^=XB=XW(P'W)−1ΔbC'.

The SIMPLS algorithm also requires a matrix of Y weights, also called Y loadings, that is computed in a different fashion than in NIPALS. Nevertheless, we call this matrix C. Then, for SIMPLS, β is estimated by

(4.3) B = WC'

and Y is estimated in terms of X by

Y^=XB=XWC'.

Properties

Perhaps the most important property, shared by both NIPALS and SIMPLS, is that, subject to norm restrictions, both methods maximize the covariance between the X structure and the Y structure for each factor. The precise sense in which this property holds is one of the features that distinguishes NIPALS and SIMPLS. In NIPALS, the covariance is maximized for components defined on the residual matrices. In contrast, the maximization in SIMPLS applies directly to the centered and scaled X and Y matrices.

The scores, which form the basis for PLS modeling, are constructed from the weights. The weights are the vectors that define linear combinations of the Xs that maximize covariance with the Ys. Maximizing the covariance is directly related to maximizing the correlation. One can show that maximizing the covariance is equivalent to maximizing the product of the squared correlation between the X and Y structures, and the variance of the X structure. (See the section “Bias toward X Directions with High Variance” in Appendix 1, or Hastie et al. 2001.)

Recalling that correlation is a scale-invariant measure of linear relationship, this insight shows that the PLS model is pulled toward directions in X space that have high variability. In other words, the PLS model is biased away from directions in the X space with low variability. (This is illustrated in the section “PLS versus PCA”.) As the number of latent factors increases, the PLS model approaches the standard least squares model.

The vector of X scores, ti, represents the location of the rows of X projected onto the ith factor, wi. The entries of the X loading vector at the ith iteration are proportional to the correlations of the centered and scaled predictors with ti. So, the term loading refers to how the predictors relate to a given factor in terms of degree of correlation. Similarly, the entries of the Y loading vector at the ith iteration are proportional to the correlations of the centered and scaled responses with ti. JMP scales all loading vectors to have length one. Note that Y loadings are not of interest unless there are multiple responses. (See “Properties of the NIPALS Algorithm” in Appendix 1.)

It is also worth pointing out that the factors that define the linear surface onto which the X values are projected are orthogonal to each other. This has these advantages:

• Because they relate to independent directions, the scores are easy to interpret.

• If we were to fit two models, say, one with only one extracted factor and one with two, the single factor in the first model would be identical to the first factor in the second model. That is, as we add more factors to a PLS model, we do not disturb the ones we already have. This is a useful feature that it is not shared by all projection-based methods; independent component analysis (Hastie et al. 2001) is an example of a technique that does not have this feature.

We detail properties associated with both fitting algorithms in Appendix 1.

Example

Now, to gain a deeper understanding of two of the basic elements in PLS, the scores and loadings, open the data table PLSScoresAndLoadings.jmp by clicking on the correct link in the master journal. This table contains two predictors, x1 and x2, and two responses, y1, and y2, as well as other columns that have been saved, as we shall see, as the result of a PLS analysis. The table also contains six scripts, which we run in order.

Run the first script, Scatterplot Matrix, to explore the relationships among the two predictors, x1 and x2, and the two responses, y1, and y2 (Figure 4.12). The scatterplot in the upper left shows that the predictors are strongly correlated, whereas the scatterplot in the lower right shows that the responses are not very highly correlated. (See the yellow cells in Figure 4.12.)

Figure 4.12: Scatterplots for All Four Variables


The ranges of values suggest that the variables have already been centered and scaled. To verify this, run the script X and Y are Centered and Scaled. This produces a summary table showing the means and standard deviations of the four variables.

Discovering Partial Least Squares with JMP

Подняться наверх