Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 21
1.3.2 MEASURING THE STRENGTH OF THE REGRESSION RELATIONSHIP
ОглавлениеThe least squares estimates possess an important property:
This formula says that the variability in the target variable (the left side of the equation, termed the corrected total sum of squares) can be split into two mutually exclusive parts — the variability left over after doing the regression (the first term on the right side, the residual sum of squares), and the variability accounted for by doing the regression (the second term, the regression sum of squares). This immediately suggests the usefulness of as a measure of the strength of the regression relationship, where
The value (also called the coefficient of determination) estimates the population proportion of variability in accounted for by the best linear combination of the predictors. Values closer to indicate a good deal of predictive power of the predictors for the target variable, while values closer to indicate little predictive power. An equivalent representation of is
where
is the sample correlation coefficient between and (this correlation is called the multiple correlation coefficient). That is, is a direct measure of how similar the observed and fitted target values are.
It can be shown that is biased upwards as an estimate of the population proportion of variability accounted for by the regression. The adjusted corrects this bias, and equals
It is apparent from (1.7) that unless is large relative to (that is, unless the number of predictors is large relative to the sample size), and will be close to each other, and the choice of which to use is a minor concern. What is perhaps more interesting is the nature of as providing an explicit tradeoff between the strength of the fit (the first term, with larger corresponding to stronger fit and larger ) and the complexity of the model (the second term, with larger corresponding to more complexity and smaller ). This tradeoff of fidelity to the data versus simplicity will be important in the discussion of model selection in Section 2.3.1.
The only parameter left unaccounted for in the estimation scheme is the variance of the errors . An unbiased estimate is provided by the residual mean square,
This estimate has a direct, but often underappreciated, use in assessing the practical importance of the model. Does knowing really say anything of value about ? This isn't a question that can be answered completely statistically; it requires knowledge and understanding of the data and the underlying random process (that is, it requires context). Recall that the model assumes that the errors are normally distributed with standard deviation . This means that, roughly speaking, of the time an observed value falls within of the expected response
can be estimated for any given set of values using
while the square root of the residual mean square (1.8), termed the standard error of the estimate, provides an estimate of that can be used in constructing this rough prediction interval .