Читать книгу Discovering Partial Least Squares with JMP - Marie Gaudard A. - Страница 8

Оглавление

2

A Review of Multiple Linear Regression

The Cars Example

Estimating the Coefficients

Underfitting and Overfitting: A Simulation

The Effect of Correlation among Predictors: A Simulation

The Cars Example

Consider Figure 2.1, which displays the data table CarsSmall.jmp. You can open this table by clicking on the correct link in the master journal. This data table consists of six rows, corresponding to specific cars of different types, and six variables from the JMP sample data table Cars.jmp.

Figure 2.1: Data Table CarsSmall.jmp


The first column, Automobile, is an identifier column. Our goal is to predict miles per gallon (MPG) from the other descriptive variables. So, in this context, the variable MPG is the single variable in Y, and X consists of the four variables Number of Cylinders, HP (horsepower), Weight, and Transmission (with values “Man” and “Auto”, for manual and automatic transmissions, respectively).

This data structure is typical of the type of data to which multiple linear regression (MLR), or more generally, any modeling approach, is applied. This familiar tabular structure leads naturally to the representation and manipulation of data values as matrices.

To be more specific, a multiple linear regression model for our data can be represented as shown here:

(2.1)(21.022.818.718.114.324.4)=[ 161102.62Man(0)14932.32Man(0)181753.44Auto(1)161053.46Auto(1)182453.57Auto(1)14623.19Auto(1) ]*(β0β1β2β3β4)+(ε1ε2ε3ε4ε5ε6)

Here are various items to note:

1. The values of the response, MPG, are presented on the left side of the equality sign, in the form of a column vector, which is a special type of matrix that contains only a single column. In our example, this is the only column in the response matrix Y.

2. The rectangular array to the immediate right of the equality sign, delineated by square brackets, consists of five columns. There is a column of ones followed by four columns consisting of the values of our four predictors, Number of Cylinders, HP (horsepower), Weight, and Transmission. These five columns are the columns in the matrix X.

3. In parentheses, next to the entries in the last column of X, the Transmission value labels, “Man” and “Auto” have been assigned the numerical values 0 and 1, respectively. Because matrices can contain only numeric data, the values of the variable Transmission have to be coded in a numerical form. When a nominal variable is included in a regression model, JMP automatically codes that column, and you can interpret reports without ever knowing what has happened behind the scenes. But if you are curious, select Help > Books > Fitting Linear Models, and search for “Nominal Effects” and “Nominal Factors”.

4. The column vector consisting of βs, denoted β, contains the unknown coefficients that relate the entries in X to the entries in Y. These are usually called regression parameters.

5. The column vector consisting of epsilons (εi), denotedε, contains the unknown errors. This vector represents the variation that is unexplained when we model Y using X.

The symbol “*” in Equation (2.1) denotes matrix multiplication. The expanded version of Equation (2.1) is:

(2.2)(21.022.818.718.114.324.4)=(β0+6β1+110β2+2.62β3+ε1β0+4β1+ 93β2+2.32β3+ε2β0+8β1+175β2+3.44β3+β4+ε3β0+6β1+105β2+3.46β3+β4+ε4β0+8β1+245β2+3.57β3+β4+ε5β0+4β1+ 62β2+3.19β3+β4+ε6)

Equation (2.2) indicates that each response is to be modeled as a linear function of the unknown βs.

We can represent Equation (2.1) more generically as:

(2.3)(Y1Y2Y3Y4Y5Y6)=[ X10X11X12X13X14X20X21X22X23X24X30X31X32X33X34X40X41X42X43X44X50X51X52X53X54X60X61X62X63X64 ]*(β0β1β2β3β4)+(ε1ε2ε3ε4ε5ε6)

Now we can write Equation (2.3) succinctly as:

(2.4) Y = Xβ + ε,

Here

Y=(Y1Y2Y3Y4Y5Y6),

X=[ X10X11X12X13X14X20X21X22X23X24X30X31X32X33X34X40X41X42X43X44X50X51X52X53X54X60X61X62X63X64 ],

β=(β0β1β2β3β4),

and

ε=(ε1ε2ε3ε4ε5ε6)

For a column vector like Y, we need only one index to designate the row in which an element occurs. For the 6 by 5 matrix X, we require two indices. The first designates the row and the second designates the column. Note that we have not specified the matrix multiplication operator in Equation (2.4); it is implied by the juxtaposition of any two matrices.

Equation (2.4) enables us to note the following:

1. The entries in X consist of the column of ones followed by the observed data on each of the four predictors.

2. Even though the entries in X are observational data, rather than the result of a designed experiment, the matrix X is still called the design matrix.

3. The vector ε, which contains the errors,εi, is often referred to as the noise.

4. Once we have estimated the column vector β, we are able to obtain predicted values of MPG. By comparing these predicted values to their actual values, we obtain estimates of the errors,εi. These differences, namely the actual minus the predicted values, are called residuals.

5. If the model provides a good fit, we expect the residuals to be small, in some sense. We also expect them to show a somewhat random pattern, indicating that our model adequately captures the structural relationship between X and Y. If the residuals show a structured pattern, one remedy might be to specify a more complex model by adding additional columns to X; for example, columns that define interaction terms and/or power terms (Draper and Smith 1998).

6. The values in β are the coefficients or parameters that correspond to each column or term in the design matrix X (including the first, constant term). In terms of this linear model, their interpretation is straightforward. For example, β3 is the expected change in MPG for a unit change in Weight.

7. Note that the dimensions of the matrices (number of rows and columns) have to conform in Equation (2.4). In our example, Y is a 6 by 1 matrix, X is a 6 by 5 matrix, β is a 5 by 1 matrix, and ε is a 6 by 1 matrix.

Estimating the Coefficients

So how do we calculate β from the data we have collected? There are numerous approaches, partly depending on the assumptions you are prepared or required to make about the noise component, ε. It is generally assumed that the X variables are measured without noise, so that the noise is associated only with the measurement of the response, Y.

It is also generally assumed that the errors,εi, are identically and independently distributed according to a normal distribution (Draper and Smith 1998). Once a model is fit to the data, your next step should be to check if the pattern of residuals is consistent with this assumption.

For a full introduction to MLR in JMP using the Fit Model platform, select Help > Books > Fitting Linear Models. When the PDF opens, go to the chapter entitled “Standard Least Squares Report and Options.”

More generally, returning to the point about matrix dimensions, the dimensions of the components of a regression model of the form

(2.5) Y = Xβ + ε

can be represented as follows:

Y is an n x 1 response matrix.

X is an n x m design matrix

β is an m x 1 coefficient vector.

• ε is an n x 1 error vector.

Here n is the number of observations and m is the number of columns in X. For now, we assume that there is only one column in Y, but later on, we consider situations where Y has multiple columns.

Let’s pause for a quick linear algebra review. If A is any r x s matrix with elements αij, then the matrix A’, with elements αji, is called the transpose of A. Note that the rows of A are the columns of A’. We denote the inverse of a square q x q matrix B by B-1. By definition, BB-1 = B-1B = 1, where I is the q x q identity matrix (with 1’s down the leading diagonal and 0’s elsewhere). If the columns of an arbitrary matrix, say A, are linearly independent, then it can be shown that the inverse of the matrix A’A exists.

In MLR, when nm and when the columns of X are linearly independent so that the matrix (X'X)-1 exists, the coefficients in β can be estimated in a unique fashion as:

β^=(X'X)-1X'Y

The hat above β in the notation β^ indicates that this vector contains numerical estimates of the unknown coefficients in β. If there are fewer observations than columns in X, n < m, then there are an infinite number of solutions for β in Equation (2.5).

As an example, think of trying to fit two observations with a matrix X that has three columns. Then, geometrically, the expression in Equation (2.5) defines a hyperplane which, given that m = 3 in this case, is simply a plane. But there are infinitely many planes that pass through any two given points. There is no way to determine which of these infinitely many solutions would be best at predicting new observations well.

Underfitting and Overfitting: A Simulation

To better understand the issues behind model fitting, let’s run the script PolyRegr.jsl by clicking on the correct link in the master journal.

The script randomly generates Y values for eleven points with X values plotted horizontally and equally spaced over the interval 0 to 1 (Figure 2.2). The points exhibit some curvature. The script uses MLR to predict Y from X, using various polynomial models. (Note that your points will differ from ours because of the randomness.)

When the slider in the bottom left corner is set at the far left, the order of the polynomial model is one. In other words, we are fitting the data with a line. In this case, the design matrix X has two columns, the first containing all 1s and the second containing the horizontal coordinates of the plotted points. The linear fit ignores the seemingly obvious pattern in the data— it is underfitting the data. This is evidenced by the residuals, whose magnitudes are illustrated using vertical blue lines. The RMSE (root mean square error) is calculated by squaring each residual, averaging these (specifically, dividing their sum by the number of observations minus one, minus the number of predictors), and then taking the square root.

As we shift the slider to the right, we are adding higher-order polynomial terms to the model. This is equivalent to adding additional columns to the design matrix. The additional polynomial terms provide a more flexible model that is better able to capture the important characteristics, or the structure, of the data.

Figure 2.2 Illustration of Underfitting and Overfitting, with Order = 1, 2, 3, and 10


However, we get to a point where we go beyond modeling the structure of the data, and begin to model the noise in the data. Note that, as we increase the order of the polynomial, thereby adding more terms to the model, the RMSE progressively reduces. An order 10 polynomial, obtained by setting the slider all the way to the right, provides a perfect fit to the data and gives RMSE = 0 (bottom right plot in Figure 2.2). However, this model is not generalizable to new data, because it has modeled both the structure and the noise, and by definition the noise is random and unpredictable. Our model has overfit the data.

In fitting models, we must strike a balance between modeling the intrinsic structure of the data and modeling the noise in the data. One strategy for reaching this goal is the use of cross-validation, which we shall discuss in the section “Choosing the Number of Factors” in Chapter 4. You can close the report produced by PolyRegr.jsl at this point.

The Effect of Correlation among Predictors: A Simulation

In MLR, correlation among the predictors is called multicollinearity. We explore the effect of multicollinearity on estimates of the regression coefficients by running the script Multicollinearity.jsl. Do this by clicking on the correct link in the master journal. The script produces the launch window shown in Figure 2.3.

Figure 2.3: Multicollinearity Simulation Launch Window


The launch window enables you to set conditions to simulate data from a known model:

• You can set the values of the three regression coefficients: Beta0 (constant); Beta1 (X1 coefficient); and Beta2 (X2 coefficient). Because there are three regression parameters, you are defining a plane that models the mean of the response, Y. In symbols,

E[Y] = β0 + β1X1 + β2X2

where the notation E[Y] represents the expected value of Y.

• The noise that is applied to Y is generated from a normal distribution with mean 0 and with the standard deviation that you set as Sigma of Random Noise under Other Parameters. In symbols, this means that ε in the expression

Y = β0 + β1X1 + β2X2 + ε

has a normal distribution with mean 0 and standard deviation equal to the value you set.

• You can specify the correlation between the values of X1 and X2 using the slider for Correlation of X1 and X2 under Other Parameters. X1 and X2 values will be generated for each simulation from a multivariate normal distribution with the specified correlation.

• In the Size of Simulation panel, you can specify the Number of Points to be generated for each simulation, as well as the Number of Simulations to run.

Once you have set values for the simulation using the slider bars, generate results by clicking Simulate. Depending on your screen size, you can view multiple results simultaneously without closing the launch window.

Let’s first run a simulation with the initial settings. Then run a second simulation after moving the Correlation of X1 and X2 slider to a large, positive value. (We have selected 0.92.) Your reports will be similar to those shown in Figure 2.4.

Figure 2.4: Comparison of Design Settings, Low and High Predictor Correlation


The two graphs reflect the differences in the settings of the X variables for the two correlation scenarios. In the first, the points are evenly distributed in a circular pattern. In the second, the points are condensed into a narrow elliptical pattern. These patterns show the geometry of the design matrix for each scenario.

In the high correlation scenario, note that high values of X1 tend to be associated with high values of X2, and that low values of X1 tend to be associated with low values of X2. This is exactly what is expected for positive correlation. (For a definition of the correlation coefficient between observations, select Help > Books > Multivariate Methods and search for “Pearson Product-Moment Correlation”).

The true and estimated coefficient values are shown at the bottom of each plot. Because our model was not deterministic—the Y values were generated so that their means are linear functions of X1 and X2, but the actual values are affected by noise—the estimated coefficients are just that, estimates, and as such, they reflect uncertainty. This uncertainty is quantified in the columns Std Error (standard error), Lower95% (the lower 95% limit for the coefficient’s confidence interval), and Upper95% (the upper 95% limit for the coefficient’s confidence interval).

Notice that the estimates of beta1 and beta2 can be quite different from the true values in the high correlation scenario (bottom plot in Figure 2.4). Consistent with this, the standard errors are larger and the confidence intervals are wider.

Let’s get more insight on the impact that changing the correlation value has on the estimates of the coefficients. Increase the Number of Simulations to about 500 using the slider, and again simulate with two different values of the correlation, one near zero and one near one. You should obtain results similar to those in Figure 2.5.

Figure 2.5: Plots of Estimates for Coefficients, Low and High Predictor Correlation


These plots show Estimate beta1 and Estimate beta2, the estimated values of β1 and β2, from the 500 or so regression fits. The reference lines show the true values of the corresponding parameters, so that the intersection of these two lines shows the pair of true values used to simulate the data. In an ideal world, all of the estimate pairs would be very close to the point defined by the true values.

When X1 and X2 have correlation close to zero, the parameter estimates cluster rather uniformly around the true value. However, the impact of high correlation between X1 and X2 is quite dramatic. As this correlation increases, the estimates of β1 and β1 become much more variable, but also more strongly (and negatively) correlated themselves. As the correlation between the two predictors X1 and X2 approaches +1.0 or –1.0, we say that the X’X matrix involved in the MLR solution, becomes ill-conditioned (Belsley 1991).

In fact, when there is perfect correlation between X1 and X2, the MLR coefficient estimates cannot be computed because the matrix (X’X)1 does not exist. The situation is similar to trying to build a regression model for MPG with two redundant variables, say, “Weight of Car in Kilograms” and “Weight of Car in Pounds.” Because the two predictors are redundant, there really is only a single predictor, and the MLR algorithm doesn’t know where to place its coefficients. There are infinitely many ways that the coefficients can be allocated to both terms to produce the same model.

In cases of multicollinearity, the coefficient estimates are highly variable, as you see in Figure 2.5. This means that estimates have high standard errors, so that confidence intervals for the parameters are wide. Also, hypothesis tests can be ineffective because of the uncertainty inherent in the parameter estimates. Much research has been devoted to detecting multicollinearity and dealing with its consequences. Ridge regression and the lasso method (Hastie et al. 2001) are examples of regularization techniques that can be useful when high multicollinearity is present. (In JMP Pro 11, select Help > Books > Fitting Linear Models and search for “Generalized Regression Models”.)

Whether multicollinearity is of concern depends on your modeling objective. Are you interested in explaining or predicting? Multicollinearity is more troublesome for explanatory models, where the goal is to figure out which predictors have an important effect on the response. This is because the parameter estimates have high variability, which negatively impacts any inference about the predictors. For prediction, the model is useful, subject to the general caveat that an empirical statistical model is good only for interpolation, rather than extrapolation. For example, in the correlated case shown in Figure 2.4, one would not be confident in making predictions when X1 = +1 and X2 = -1 because the model is not supported by any data in that region.

You can close the reports produced by Multicollinearity.jsl at this point.

Discovering Partial Least Squares with JMP

Подняться наверх