Читать книгу Data Science in Theory and Practice - Maria Cristina Mariani - Страница 31
3.5 Correlation Matrices
ОглавлениеA correlation matrix is a table showing correlation coefficients between variables. Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. The sample correlation between the th and th variables is defined as
(3.6)
where
Substituting and into (3.6) and canceling terms, we obtain
(3.7)
for and . We note that the sample correlation is symmetric since for all and .
The sample correlation coefficient is a measure of the linear association between two variables and does not depend on the units of measurement, i.e. when you construct the sample correlation coefficient, the units of measurement that are used cancel out. The sample correlation matrix is analogous to the covariance matrix with correlations in place of covariances:
(3.8)
The population correlation matrix similar to (3.8) is defined as follows:
(3.9)
where
We note that even though the signs of the sample correlation and the sample covariance are the same, the correlation is easier to interpret because its magnitude is bounded. It is bounded within the closed interval . To summarize, the sample correlation has the following properties:
1 The value of the sample correlation must lie between and inclusive. indicates perfect linear relationship and indicates perfect inverse relationship.
2 The sample correlation measures the strength of the linear association between two variables. If equals to zero, it implies no linear association between the components. Otherwise, the sign of indicates the direction of the association. If is positive, it means that as one variable gets larger the other gets larger. If is negative, it means that as one gets larger, the other gets smaller (often called an “inverse” correlation). A larger value of implies greater linear strength. This is an indication that both variables move in the opposite direction if one variable increases, the other variable decreases with the same magnitude (and vice versa).
Example 3.4 Consider the following data matrix introduced in Example 3.1:
Each receipt yields a pair of measurements, total dollar sales, and number of movies sold. We find the sample correlation as follows:
Therefore,
In this example, we observe the variables and are highly positively correlated since . This implies that if dollar sales () increases, the number of movies sold () also increases.