Читать книгу Applied Biostatistics for the Health Sciences - Richard J. Rossi - Страница 51
2.1.3 Multivariate Data
ОглавлениеIn most research problems, there will be many variables that need to be measured. When the collection of variables measured on each unit consists of two or more variables, a data set is called a multivariate data set, and a multivariate data set consisting of only two variables is called a bivariate data set. In a multivariate data set, there is usually one variable that is of primary interest to a research question that is believed to be explained by some of the other variables measured in the study. The variable of primary interest is called a response variable and the variables believed to cause changes in the response are called explanatory variables or predictor variables. The explanatory variables are often referred to as the input variables and the response variable is often referred to as the output variable. Furthermore, in a statistical model, the response variable is the variable that is being modeled; the explanatory variables are the input variables in the model that are believed to cause or explain differences in the response variable. For example, in studying the survival of melanoma patients, the response variable might be Survival Time that is expected to be influenced by the explanatory variables Age, Gender, Clark’s Stage, and Tumor Size. In this case, a model relating Survival Time to the explanatory variables Age, Gender, Clark’s Stage, and Tumor Size might be investigated in the research study.
A multivariate data set often consists of a mixture of qualitative and quantitative variables. For example, in a biomedical study, several variables that are commonly measured are a subject’s age, race, gender, height, and weight. When data have been collected, the multivariate data set is generally stored in a spreadsheet with the columns containing the data on each variable and the rows of the spreadsheet containing the observations on each subject in the study.
In studying the response variable, it is often the case that there are subpopulations that are determined by a particular set of values of the explanatory variables that will be important in answering the research questions. In this case, it is critical that a variable be included in the data set that identifies which subpopulation each unit belongs to. For example, in the National Health and Nutrition Examination Survey (NHANES) study, the distribution of the weight of female children was studied. The response variable in this study was weight and some of the explanatory variables measured in this study were height, age, and gender. The result of this part of the NHANES study was a distribution of the weights of females over a certain range of age. The resulting distributions were summarized in the chart given in Figure 2.2 that shows the weight ranges for females for several different ages.
Figure 2.2 Weight-by-age chart for girls in the NHANES study.
Example 2.8
In the article “The validity of self-reported weight in US adults: a population based cross-sectional study” published in BMC Public Health (Villanueva, 2001), the author reported the results of a study on the validity of self-reported weight. The data set used in the study was a multivariate data set with response variable being the difference between the self-reported weight and the actual weight of an individual. The explanatory variables in this study were gender, age, race–ethnicity, highest educational attainment, level of activity, and perception of the individuals’ current weight.