Читать книгу Applied Biostatistics for the Health Sciences - Richard J. Rossi - Страница 60

2.2.7 Parameters for Bivariate Populations

Оглавление

In most biomedical research studies, there are many variables that will be recorded on each individual in the study. A multivariate distribution can be formed by jointly tabulating, charting, or graphing the values of the variables over the N units in the population. For example, the bivariate distribution of two variables, say X and Y, is the collection of the ordered pairs


These N ordered pairs form the units of the bivariate distribution of X and Y and their joint distribution can be displayed in a two-way chart, table, or graph.

When the two variables are qualitative, the joint proportions in the bivariate distribution are often denoted by pab, where


The joint proportions in the bivariate distribution are then displayed in a two-way table or two-way bar chart. For example, according to the American Red Cross, the joint distribution of blood type and Rh factor is given in Table 2.7 and presented as a bar chart in Figure 2.21.


Figure 2.21 The joint distribution of blood type and Rh factor according to the American Red Cross.

Table 2.7 The Distribution of Blood Type by Rh Factor According to the American Red Cross

Blood Type Rh Factor
+
O 38% 7%
A 34% 6%
B 9% 2%
AB 3% 1%

In a bivariate distribution where one of the variables is quantitative and the other is qualitative, the best way to graphically present the distribution is to separate the distribution into subpopulations according to the values of the qualitative distribution. For example, if W=the weight of anindividual and G=the sex of an individual, then the best way to present the bivariate distribution of weight and gender is to present the two subpopulations separately as shown in Figure 2.22.


Figure 2.22 The distribution weight for the subpopulations of mean and women.

In a multivariate population, the subpopulations remain important, and the individual subpopulation proportions, percentiles, mean, median, modes, standard deviation, variance, interquartile range are important parameters that can still be used to summarize each of the subpopulations.

In a bivariate distribution where both of the variables are quantitative, a three-dimensional graph can be used to represent the joint distribution of the variables. The joint distribution is displayed as a three-dimensional probability density graph with one axis for each of the variables and the third axis representing the joint density at each pair (X,Y); however, three-dimensional density plots are sometimes difficult to interpret. An example of a three-dimensional density plot is given in Figure 2.23.


Figure 2.23 A density plot for a bivariate distribution.

To summarize the bivariate distribution of two quantitative variables, proportions, percentiles, mean, median, mode, standard deviation, variance, and interquartile range can be computed for each variable. In a bivariate distribution, the parameters associated with each separate variable are distinguished from each other by the use of subscripts. For example, if the two variables are labeled X and Y, then the mean, median, mode, standard deviation, variance, and interquartile range of the population associated with the variable X will be denoted by


and similarly for Y


A parameter that measures the joint relationship between two quantitative variables, say X and Y, is the correlation coefficient that will be denoted by ρ. The correlation coefficient measures the linear relationship between X and Y. That is, ρ measures the overall agreement between the pairs (X, Y) and the line Y=aX+b. The correlation coefficient is defined as


where μx, μy and σx, σy are the means and standard deviations of X and Y, respectively.

The correlation coefficient is a unitless parameter that falls between −1 and 1. That is, the correlation between X and Y does not depend on the particular scales of units the variables are measured in. For example, if X is the height of an individual in pounds and Y is the height of an individual in inches, then the value of the correlation coefficient will be the same when X is measured in kilograms and Y is measured in centimeters. Also, the correlation between X and Y is the same as the correlation between Y and X (i.e., Corr(X,Y)= Corr(Y, X)).

It is important to note that the correlation coefficient can be used only with two quantitative variables and only measures the strength of the linear relationship between X and Y. A positive value of the correlation coefficient suggests that the larger values of X are more likely to occur with the larger values of Y and the smaller values of X with the smaller values of Y. A negative correlation indicates the larger values of X are more likely to occur with the smaller values of Y and the smaller values of X with the larger values of Y. Several properties of the correlation coefficient are listed below.

1 The value of the correlation coefficient is always between −1 and 1 (i.e., −1≤ ρ≤1). When the correlation coefficient equals −1 or 1, the variables X and Y are said to be perfectly correlated. When two variables X and Y are perfectly correlated, the linear relationship is exact and Y=aX+b for some values a and b. In this case, the value of Y is determined by the value of X. Furthermore, when ρ=−1 the value of b will be negative, and when ρ = 1 the value of b will be positive.

2 When ρ≠±1, the value of Y cannot be perfectly predicted from the value of X and the relationship iswhere ε is an error term associated with the deviation from the linear relationship. The closer ρ is to ±1, the stronger the linear relationship between the variables X and Y.

3 The strength of the linear relationship between X and Y is based on the value of the correlation coefficient and is often summarized according to the following guidelines:−0.30≤ρ<0.30 indicates at most a very weak linear relationship,−0.50<ρ≤−0.30 or 0.30≤ρ<0.50 indicates a weak linear relationship,−0.80<ρ≤−0.50 or 0.50≤ρ<0.80 indicates a moderate linear relationship,−0.90<ρ≤−0.80 or 0.80≤ρ<0.90 indicates a strong linear relationship,ρ≤−0.90 or ρ≥0.90 indicates a very strong linear relationship.However, any discussion of the strength of the linear relationship between two variables must take into account the standards used in the discipline in which the research is being carried out.

4 When ρ ≈ 0, there is no apparent linear relationship between the two variables. However, this does not exclude the possibility that there is a curvilinear relationship between the two variables.

In a multivariate distribution with more than two quantitative variables, the correlation coefficient can be computed for each pair of variables. For example, with three quantitative variables, say X,Y, and Z, the three correlation coefficients that can be computed are Corr(X,Y)=ρxy, Corr(X,Z)=ρxz, and Corr(Y,Z)=ρyz. In most biomedical studies, there is a well-defined response variable and a set of explanatory variables. Since the changes in the explanatory variables are believed to cause changes in the response variable, the most important correlations to consider are those between the response variable and each of the explanatory variables.

Finally, correlation should not be confused with causation. A causal relationship exists when changing the value of X directly causes a change in the value of Y or vice versa. The correlation coefficient only measures the tendency for the value of Y to increase or decrease linearly with the values of X. Thus, a high correlation between X and Y does not necessarily indicate that changes in X will cause changes in Y. For example, there is a positive correlation between the number of times an individual on a diet weighs themselves in a week and their weight loss. Clearly, the number of times an individual weighs themselves does not cause a change in their weight. Causal relationships must be supported by honest logical and scientific reasoning. With the proper use of scientific principles and well-designed experiments, high correlations can often be used as evidence supporting a causal relationship.

Applied Biostatistics for the Health Sciences

Подняться наверх