Читать книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. - Страница 7
Оглавление1
Introducing Partial Least Squares
Partial Least Squares in Today’s World
Transforming, and Centering and Scaling Data.
Modeling in General
Applied statistics can be thought of as a body of knowledge, or even a technology, that supports learning about the real world in the face of uncertainty. The theme of learning is ubiquitous in more or less every context that can be imagined, and along with this comes the idea of a (statistical) model that tries to codify or encapsulate our current understanding.
Many statistical models can be thought of as relating one or more inputs (which we call collectively X) to one or more outputs (collectively Y). These quantities are measured on the items or units of interest, and models are constructed from these observations. Such observations yield quantitative data that can be expressed numerically or coded in numerical form.
By the standards of fundamental physics, chemistry, and biology, at least, statistical models are generally useful when current knowledge is moderately low and the underlying mechanisms that link the values in X and Y are obscure. So although one of the perennial challenges of any modeling activity is to take proper account of whatever is already known, the fact remains that statistical models are generally empirical in nature. This is not in any sense a failing, since there are many situations in research, engineering, the natural sciences, the physical sciences, life science, behavioral science, and other areas in which such empirical knowledge has practical utility or opens new, useful lines of inquiry.
However, along with this diversity of contexts comes a diversity of data. No matter what its intrinsic beauty, a useful model must be flexible enough to adequately support the more specific objectives of prediction from or explanation of the data presented to it. As we shall see, one of the appealing aspects of partial least squares as a modeling approach is that, unlike some more traditional approaches that might be familiar to you, it is able to encompass much of this diversity within a single framework.
A final comment on modeling in general—all data is contextual. Only you can determine the plausibility and relevance of the data that you have, and you overlook this simple fact at your peril. Although statistical modeling can be invaluable, just looking at the data in the right way can and should illuminate and guide the specifics of building empirical statistical models of any kind (Chatfield 1995).
Partial Least Squares in Today’s World
Increasingly, we are finding data everywhere. This data explosion, supported by innovative and convergent technologies, has arguably made data exploration (e-Science) a fourth learning paradigm, joining theory, experimentation, and simulation as a way to drive new understanding (Microsoft Research 2009).
In simple retail businesses, sellers and buyers are wrestling for more leverage over the selling/buying process, and are attempting to make better use of data in this struggle. Laboratories, production lines, and even cars are increasingly equipped with relatively low-cost instrumentation routinely producing data of a volume and complexity that was difficult to foresee even thirty years ago. This book shows you how partial least squares, with its appealing flexibility, fits into this exciting picture.
This abundance of data, supported by the widespread use of automated test equipment, results in data sets with a large number of columns, or variables, v and/or a large number of observations, or rows, n. Often, but not always, it is cheap to increase v and expensive to increase n.
When the interpretation of the data permits a natural separation of variables into predictors and responses, partial least squares, or PLS for short, is a flexible approach to building statistical models for prediction. PLS can deal effectively with the following:
• Wide data (when v >> n, and v is large or very large)
• Tall data (when n >> v, and n is large or very large)
• Square data (when n ~ v, and n is large or very large)
• Collinear variables, namely, variables that convey the same, or nearly the same, information
• Noisy data
Just to whet your appetite, we point out that PLS routinely finds application in the following disciplines as a way of taming multivariate data:
• Psychology
• Education
• Economics
• Political science
• Environmental science
• Marketing
• Engineering
• Chemistry (organic, analytical, medical, and computational)
• Bioinformatics
• Ecology
• Biology
• Manufacturing
Transforming, and Centering and Scaling Data
Data should always be screened for outliers and anomalies prior to any formal analysis, and PLS is no exception. In fact, PLS works best when the variables involved have somewhat symmetric distributions. For that reason, for example, highly skewed variables are often logarithmically transformed prior to any analysis.
Also, the data are usually centered and scaled prior to conducting the PLS analysis. By centering, we mean that, for each variable, the mean of all its observations is subtracted from each observation. By scaling, we mean that each observation is divided by the variable’s standard deviation. Centering and scaling each variable results in a working data table where each variable has mean 0 and standard deviation 1.
The reason that centering and scaling are important is because the weights that form the basis for the PLS model are very sensitive to the measurement units of the variables. Without centering and scaling, variables with higher variance have more influence on the model. The process of centering and scaling puts all variables on an equal footing. If certain variables in X are indeed more important than others, and you want them to have higher influence, you can accomplish this by assigning them a higher scaling weight (Eriksson et al. 2006). As you will see, JMP makes centering and scaling easy.
Later we discuss how PLS relates to other modeling and multivariate methods. But for now, let’s dive into an example so that we can compare and contrast it to the more familiar multivariate linear regression (MLR).
An Example of a PLS Analysis
The Data and the Goal
The data table Spearheads.jmp contains data relating to the chemical composition of spearheads known to originate from one of two African tribes (Figure 1.1). You can open this table by clicking on the correct link in the master journal. A total of 19 spearheads of known origin were studied. The Tribe of origin is recorded in the first column (“Tribe A” or “Tribe B”). Chemical measurements of 10 properties were made. These are given in the subsequent columns and are represented in the Columns panel in a column group called Xs. There is a final column called Set, indicating whether an observation will be used in building our model (“Training”) or in assessing that model (“Test”).
Figure 1.1: The Spearheads.jmp Data Table
Our goal is to build a model that uses the chemical measurements to help us decide whether other spearheads collected in the vicinity were made by “Tribe A” or “Tribe B”. Note that there are 10 columns in X (the chemical compositions) and only one column in Y (the attribution of the tribe).
The model will be built using the training set, rows 1–9. The test set, rows 10–19, enables us to assess the ability of the model to predict the tribe of origin for newly discovered spearheads. The column Tribe actually contains the numerical values +1 and –1, with –1 representing “Tribe A” and +1 representing “Tribe B". The Tribe column displays Value Labels for these numerical values. It is the numerical values that the model actually predicts from the chemical measurements.
The table Spearheads.jmp also contains four scripts that help us perform the PLS analysis quickly. In the later chapters containing examples, we walk through the menu options that enable you to conduct such an analysis. But, for now, the scripts expedite the analysis, permitting us to focus on the concepts underlying a PLS analysis.
The Analysis
The first script, Fit Model Launch Window, located in the upper left of the data table as shown in Figure 1.2, enables us to set up the analysis we want. From the red-triangle menu, shown in Figure 1.2, select Run Script. This script only runs if you are using JMP Pro since it uses the Fit Model partial least squares personality. If you are using JMP, you can select Analyze > Multivariate Methods > Partial Least Squares from the JMP menu bar. You will be able to follow the text, but with minor modifications.
Figure 1.2: Running the Script “Fit Model Launch Window”
This script produces a populated Fit Model launch window (Figure 1.3). The column Tribe is entered as a response, Y, while the 10 columns representing metal composition measurements are entered as Model Effects. Note that the Personality is set to Partial Least Squares. In JMP Pro, you can access this launch window directly by selecting Analyze > Fit Model from the JMP menu bar.
Below the Personality drop-down menu, shown in Figure 1.3, there are check boxes for Centering and Scaling. As mentioned in the previous section, centering and scaling all variables in a PLS analysis treats them equitably in the analysis. There is also a check box for Standardize X. This option, described in “The Standardize X Option” in Appendix 1, centers and scales columns that are involved in higher-order terms. JMP selects these three options by default.
Figure 1.3: Populated Fit Model Launch Window
Clicking Run brings us to the Partial Least Squares Model Launch control panel (Figure 1.4). Here, we can make choices about how we would like to fit the model. Note that we are allowed to choose between two fitting algorithms to be discussed later: NIPALS and SIMPLS. We accept the default settings. (To reproduce the exact analysis shown below, select Set Random Seed from the red triangle menu at the top of the report and enter 111.) Click Go. (You can, instead, run the script PLS Fit to see the report.)
Figure 1.4: PLS Model Launch Control Panel
This appends three new report sections, as shown in Figure 1.5: Model Comparison Summary, KFold Cross Validation with K=7 and Method=NIPALS, and NIPALS Fit with 3 Factors. Later, we fully explain the various options and report contents, but for now we take the analysis on trust in order to quickly see this example in its entirety. As we discuss later, the Number of Factors is a key aspect of a PLS model. The report in Figure 1.5 shows 3 Factors, but your report might show a different number. This is because the Validation Method of KFold, set as a default in the JMP Pro Model Launch control panel, involves an element of randomness.
Figure 1.5: Initial PLS Reports
Once you have built a model in JMP, you can save the prediction formula to the table containing the data that were analyzed. We do this for our PLS model. From the options in the red-triangle menu for the NIPALS Fit with 3 Factors, select Save Columns > Save Prediction Formula (Figure 1.6).
Figure 1.6: Saving the Prediction Formula
The saved formula column, Pred Formula Tribe, appears as the last column in the data table. Because we are actually saving a formula, we obtain predicted values for all 19 rows.
Testing the Model
To see how well our PLS model has performed, let’s simulate the arrival of new data using our test set. We would like to remove the Hide and Exclude row states from rows 10-19, and apply them to rows 1-9. You can do this by hand, or by running the script Toggle Hidden/Excluded Rows. To do this by hand, select Rows > Clear Row States, select rows 1-9, right-click in the highlighted area near the row numbers, and select Hide and Exclude. (In versions of JMP prior to JMP 11, select Exclude/Unexclude, and then right-click again and select Hide/Unhide.)
Now run the script Predicted vs Actual Tribe. For each row, this plots the predicted score for tribal origin on the vertical axis against the actual tribe of origin on the horizontal axis (Figure 1.7).
Figure 1.7: Predicted versus Actual Tribe for Test Data
To produce this plot yourself, select Graph > Graph Builder. In the Variables panel, right-click on the modeling type icon to the left of Tribe and select Nominal. (This causes the value labels for Tribe to display.) Drag Tribe to the X area and Pred Formula Tribe to the Y area.
Note that the predicted values are not exactly +1 or -1, so it makes sense to use a decision boundary (the dotted blue line at the value 0) to separate or classify the scores produced by our model into two groups. You can insert a decision boundary by double-clicking on the vertical axis. This opens the Y Axis Specification window. In the Reference Lines section near the bottom of the window, click Add to add a reference line at 0, and then enter the text Decision Boundary in the Label text box.
The important finding conveyed by the graph is that our PLS model has performed admirably. The model has correctly classified all ten observations in the test set. All of the observations for “Tribe A” have predicted values below 0 and all those for “Tribe B” have predicted values above 0.
Our model for the spearhead data was built using only nine spearheads, one less than the number of chemical measurements made. PLS provides an excellent classification model in this case.
Before exploring PLS in more detail, let’s engage in a quick review of multiple linear regression. This is a common approach to modeling a single variable in Y using a collection of variables, X.