Читать книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen - Страница 31

Sample Covariance and Correlation – Measure of Linear Association Between Two Variables

If each of the n observations of a data set is measured on two variables x₁ and x₂, let (x₁₁, x₂₁,...,x_n₁) and (x₁₂, x₂₂,...,x_n₂) denote the n observations on x₁ and x₂, respectively. The sample covariance of x₁ and x₂ is defined as

(2.2)

where x̄₁ and x̄₂ are the sample means of x₁ and x₂, respectively. The value of sample covariance of two variables is affected by the linear association between them. From (2.2), if x₁ and x₂ have a strong positive linear association, they are usually both above their means or both below their means. Consequently, the product (x_i1−x¯₁)(x_i2−x¯₂) will typically be positive and their sample covariance will have a large positive value. On the other hand, if x₁ and x₂ have a strong negative linear association, the product (x_i1−x¯₁)(x_i2−x¯₂) will typically be negative and their sample covariance will have a negative value. If y₁ and y₂ are obtained by multiplying each measurement of x₁ and x₂ with a₁ and a₂, respectively, it is easy to see from (2.2) that the sample covariance of y₁ and y₂ is

(2.3)

Equation (2.3) says that if the measurements are scaled, for example by changing measurement units, the sample covariance will be scaled correspondingly. The sample covariance’s dependence on the measurement units makes it difficult to determine how large a sample covariance indicates a strong (linear) association between two variables. The sample correlation defined as follows is a measure of linear association that does not depend on the measurement units, or scaling of the variables

(2.4)

where s₁ and s₂ are the sample standard deviation of x₁ and x₂, respectively. The sample correlation ranges between −1 and 1, with values close to 1, −1, and 0 indicating a strong positive linear association, a strong negative linear association, and no linear association, respectively.

Example 2.2 To illustrate the calculation of summary statistics, we take a random sample of 10 observations, as shown in Table 2.1, from the auto.spec data set on the variables curb.weight, length, and width. We use x_i, i =1,2,3, to represent the three variables:

Table 2.1 A random sample of 10 observations from the auto. spec data set.

x₁	x₂	x₃
3515	190.9	70.3
2300	168.7	64.0
2800	168.9	65.0
2122	166.3	64.4
2293	169.1	66.0
2765	176.8	64.8
2275	171.7	65.5
1890	159.1	64.2
2926	173.2	66.3
1909	158.8	63.6

To obtain the sample covariance for the variables curb.weight and length in the data set in Table 2.1, we first calculate the sample means x̄₁, x̄₂, and as:

By (2.2), the sample covariance of the two variables can be obtained as

The s₁₂ value of 4316.8 itself cannot tell us whether the two variables have a strong or weak (linear) relationship. Such information can be provided by the correlation. To evaluate the sample correlation, we first need the sample variance of x₁ and x₂. By (2.1), we have

By (2.4), we have

which is close to 1 and corresponding to a strong positive linear association between the curb weight and length of cars.

Example 2.3 In R, the sample mean, variance, covariance, and correlation can be found using functions mean(), var(), cov(), and cor(), respectively. For example, the following R codes can be used to find the sample mean and sample variance of curb.weight, and the sample covariance and correlation between curb.weight and length, in the auto.spec data set.

mean(auto.spec.df$curb.weight) var(auto.spec.df$curb.weight) with(auto.spec.df, cov(curb.weight, length)) with(auto.spec.df, cor(curb.weight, length))> mean(auto.spec.df$curb.weight) [1] 2555.566 > var(auto.spec.df$curb.weight) [1] 271107.9 > with(auto.spec.df, cov(curb.weight, length)) [1] 5638.336 > with(auto.spec.df, cor(curb.weight, length)) [1] 0.8777285

Note the results above are somewhat different from those in Example 2.2 because in this example we use the entire data set of auto.spec, instead of a small random subset of it as in Example 2.2.

Industrial Data Analytics for Diagnosis and Prognosis

Подняться наверх