Читать книгу Statistics and Probability with Applications for Engineers and Scientists Using MINITAB, R and JMP - Bhisham C. Gupta, Irwin Guttman - Страница 103

2.9 Measures of Association

Оглавление

So far in this chapter, the discussion was focused on only univariate statistics because we were interested in studying a single characteristic of a subject. In all the examples we considered, the variable of interest was either qualitative or quantitative. We now study cases involving two variables; this means examining two characteristics of a subject. The two variables of interest could be either qualitative or quantitative, but here we will consider only variables that are quantitative.

For the consideration of two variables simultaneously, the data obtained are known as bivariate data. In the examination of bivariate data, the first question is whether there is any association of interest between the two variables. One effective way to determine whether there is such an association is to prepare a graph by plotting one variable along the horizontal scale (x‐axis) and the second variable along the vertical scale (y‐axis). Each pair of observations is then plotted as a point in the xy‐plane. The resulting graph is called a scatter plot. A scatter plot is a very useful graphical tool because it reveals the nature and strength of associations between two variables. The following example makes the concept of association clear.

Example 2.9.1 (Cholesterol level and systolic blood pressure) The cholesterol level and the systolic blood pressure of 10 randomly selected US males in the age group 40–50 years are given in Table 2.1. Construct a scatter plot of this data and determine if there is any association between the cholesterol levels and systolic blood pressures.

Solution: Figure 2.9.1 shows the scatter plot of the data in Table 2.1. This scatter plot clearly indicates that there is a fairly strong upward linear trend. Also, if a straight line is drawn through the data points, then it can be seen that the data points are concentrated around the straight line within a narrow band. The upward trend indicates a positive association between the two variables, while the width of the band indicates the strength of the association, which in this case is quite strong. As the association between the two variables gets stronger and stronger, the band enclosing the plotted points becomes narrower and narrower. A downward trend indicates a negative association between the two variables.

A numerical measure of association between two numerical variables is called the Pearson correlation coefficient, named after the English statistician Karl Pearson (1857–1936). Note that a correlation coefficient does not measure causation. In other words, correlation and causation are different concepts. Causation causes correlation, but not necessarily the converse. The correlation coefficient between two numerical variables in a set of sample data is usually denoted by r, and the correlation coefficient for population data is denoted by the Greek letter (rho). The correlation coefficient r based on n pairs of , say is defined as

(2.9.1)

or

(2.9.2)

Table 2.9.1 Cholesterol levels and systolic BP of 10 randomly selected US males.

Subject 1 2 3 4 5 6 7 8 9 10
Cholesterol (x) 195 180 220 160 200 220 200 183 139 155
Systolic BP (y) 130 128 138 122 140 148 142 127 116 123


Figure 2.9.1 MINITAB printout of scatter plot for the data in Table 2.9.1.

The correlation coefficient is a dimensionless measure that can attain any value in the interval . As the strength of the association between the two variables grows, the absolute value of r approaches 1. Thus, when there is a perfect association between the two variables, or , depending on whether the association is positive or negative. In other words, , if the two variables are moving in the same direction, and , if the two variables are moving in the opposite direction.

Perfect association means that if we know the value of one variable, then the value of the other variable can be determined without any error. The other special case is when , which does not mean that there is no association between the two variables, but rather that there is no linear association between the two variables. As a general rule, the linear association is weak, moderate, or strong when the absolute value of is less than 0.3, between 0.3 and 0.7, or greater than 0.7, respectively. For instance, if (2.9.1) is computed for the data in Table 2.9.1, then . Hence, we can conclude that the association between the two variables X and Y is strong.

Statistics and Probability with Applications for Engineers and Scientists Using MINITAB, R and JMP

Подняться наверх