Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 11
Preface to the Second Edition
ОглавлениеThe years since the first edition of this book appeared have been fast‐moving in the world of data analysis and statistics. Algorithmically‐based methods operating under the banner of machine learning, artificial intelligence, or data science have come to the forefront of public perceptions about how to analyze data, and more than a few pundits have predicted the demise of classic statistical modeling.
To paraphrase Mark Twain, we believe that reports of the (impending) death of statistical modeling in general, and regression modeling in particular, are exaggerated. The great advantage that statistical models have over “black box” algorithms is that in addition to effective prediction, their transparency also provides guidance about the actual underlying process (which is crucial for decision making), and affords the possibilities of making inferences and distinguishing real effects from random variation based on those models. There have been laudable attempts to encourage making machine learning algorithms interpretable in the ways regression models are (Rudin, 2019), but we believe that models based on statistical considerations and principles will have a place in the analyst's toolkit for a long time to come.
Of course, part of that usefulness comes from the ability to generalize regression models to more complex situations, and that is the thrust of the changes in this new edition. One thing that hasn't changed is the philosophy behind the book, and our recommendations on how it can be best used, and we encourage the reader to refer to the preface to the first edition for guidance on those points. There have been small changes to the original chapters, and broad descriptions of those chapters can also be found in the preface to the first edition. The five new chapters (Chapters 11, 13, 14, 15, and 16, with the former chapter 11 on nonlinear regression moving to Chapter 12) expand greatly on the power and applicability of regression models beyond what was discussed in the first edition. For this reason many more references are provided in these chapters than in the earlier ones, since some of the material in those chapters is less established and less well‐known, with much of it still the subject of active research. In keeping with that, we do not spend much (or any) time on issues for which there still isn't necessarily a consensus in the statistical community, but point to books and monographs that can help the analyst get some perspective on that kind of material.
Chapter 11 discusses the modeling of time‐to‐event data, often referred to as survival data. The response variable measures the length of time until an event occurs, and a common complicator is that sometimes it is only known that a response value is greater than some number; that is, it is right‐censored. This can naturally occur, for example, in a clinical trial in which subjects enter the study at varying times, and the event of interest has not occurred at the end of the trial. Analysis focuses on the survival function (the probability of surviving past a given time) and the hazard function (the instantaneous probability of the event occurring at a given time given survival to that time). Parametric models based on appropriate distributions like the Weibull or log‐logistic can be fit that take censoring into account. Semiparametric models like the Cox proportional hazards model (the most commonly‐used model) and the Buckley‐James estimator are also available, which weaken distributional assumptions. Modeling can be adapted to situations where event times are truncated, and also when there are covariates that change over the life of the subject.
Chapter 13 extends applications to data with multiple observations for each subject consistent with some structure from the underlying process. Such data can take the form of nested or clustered data (such as students all in one classroom) or longitudinal data (where a variable is measured at multiple times for each subject). In this situation ignoring that structure results in an induced correlation that reflects unmodeled differences between classrooms and subjects, respectively. Mixed effects models generalize analysis of variance (ANOVA) models and time series models to this more complicated situation. Models with linear effects based on Gaussian distributions can be generalized to nonlinear models, and also can be generalized to non‐Gaussian distributions through the use of generalized linear mixed effects models.
Modern data applications can involve very large (even massive) numbers of predictors, which can cause major problems for standard regression methods. Best subsets regression (discussed in Chapter 2) does not scale well to very large numbers of predictors, and Chapter 14 discusses approaches that can accomplish that. Forward stepwise regression, in which potential predictors are stepped in one at a time, is an alternative to best subsets that scales to massive data sets. A systematic approach to reducing the dimensionality of a chosen regression model is through the use of regularization, in which the usual estimation criterion is augmented with a penalty that encourages sparsity; the most commonly‐used version of this is the lasso estimator, and it and its generalizations are discussed further.
Chapters 15 and 16 discuss methods that move away from specified relationships between the response and the predictor to nonparametric and semiparametric methods, in which the data are used to choose the form of the underlying relationship. In Chapter 15 linear or (specifically specified) nonlinear relationships are replaced with the notion of relationships taking the form of smooth curves and surfaces. Estimation at a particular location is based on local information; that is, the values of the response in a local neighborhood of that location. This can be done through local versions of weighted least squares (local polynomial estimation) or local regularization (smoothing splines). Such methods can also be used to help identify interactions between numerical predictors in linear regression modeling. Single predictor smoothing estimators can be generalized to multiple predictors through the use of additive functions of smooth curves. Chapter 16 focuses on an extremely flexible class of nonparametric regression estimators, tree‐based methods. Trees are based on the notion of binary recursive partitioning. At each step a set of observations (a node) is either split into two parts (children nodes) on the basis of the values of a chosen variable, or is not split at all, based on encouraging homogeneity in the children nodes. This approach provides nonparametric alternatives to linear regression (regression trees), logistic and multinomial regression (classification trees), accelerated failure time and proportional hazards regression (survival trees) and mixed effects regression (longitudinal trees).
A final small change from the first edition to the second edition is in the title, as it now includes the phrase With Applications in R. This is not really a change, of course, as all of the analyses in the first edition were performed using the statistics package R. Code for the output and figures in the book can (still) be found at its associated web site at http://people.stern.nyu.edu/jsimonof/RegressionHandbook/. As was the case in the first edition, even though analyses are performed in R, we still refer to general issues relevant to a data analyst in the use of statistical software even if those issues don't specifically apply to R.
We would like to once again thank our students and colleagues for their encouragement and support, and in particular students for the tough questions that have definitely affected our views on statistical modeling and by extension this book. We would like to thank Jon Gurstelle, and later Kathleen Santoloci and Mindy Okura‐Marszycki, for approaching us with encouragement to undertake a second edition. We would like to thank Sarah Keegan for her patient support in bringing the book to fruition in her role as Project Editor. We would like to thank Roni Chambers for computing assistance, and Glenn Heller and Marc Scott for looking at earlier drafts of chapters. Finally, we would like to thank our families for their continuing love and support.
SAMPRIT CHATTERJEE
Brooksville, Maine
JEFFREY S. SIMONOFF
New York, New York
October, 2019