Читать книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen - Страница 27

Scatter Plot Matrix and Heatmap

Оглавление

The pairwise relationship of multiple numerical variables can be visualized simultaneously by using a matrix of scatter plots. The following R codes plot the scatter plot matrix for five of the numerical variables in the auto_spec data set: wheel.base, height, curb.weight, city.mpg, and highway.mpg. The column indices of the five variables are 8, 11, 12, 22, and 23, respectively.

var.idx <- c(8, 11, 12, 22, 23) plot(auto.spec.df[, var.idx])

From the scatter plot matrix shown in Figure 2.8, there are different types of relationship among the variables. For example, there is a strong linear relationship between city.mpg and highway.mpg. Besides these two variables, wheel.base, height, and curb.weight are positively related to each other. And the curb.weight is negatively related to both city.mpg and highway.mpg.


Figure 2.8 Scatter plot matrix for five numerical variables.

For a large number of numerical variables, it is difficult to visualize all pairwise scatter plots as in the scatter plot matrix. In this case, we can use a heatmap for pairwise correlations of the variables to quickly show the strength of the relationship. The heatmap uses different shades of colors to represent the values of the correlations so that the spots or regions of strong positive or negative relationship can be quickly detected. Detailed discussion of correlation is provided in Section 2.2. We draw the heatmap of correlations for all numerical variables in the auto_spec data set using the following R codes.

library(gplots) var.idx <-c(8:12, 15, 17:23) data.nomiss <- na.omit(auto.spec.df[, var.idx]) heatmap.2(cor(data.nomiss), Rowv = FALSE, Colv = FALSE, dendrogram = “none”, cellnote = round(cor(data.nomiss),2), notecol = “black”, key = FALSE, trace = ’none’, margins=c(10,10))

In the above R codes, we use the heatmap.2() function from the gplots package to draw the heatmap. We first remove the observations with missing values using the na.omit() function. Then the heatmap is drawn for the pairwise correlations calculated by cor(). In the heatmap of all numerical variables, as shown in Figure 2.9, a lighter color indicates a strong positive (linear) relationship between the variables and a darker color indicates a strong negative (linear) relationship. The correlation values are shown within each cell of the heatmap table. The diagonal cells have the lightest color because any variable has the strongest relationship to itself. From the heatmap in Figure 2.9, we can also see that the two MPG variables (city.mpg and highway.mpg) have strong negative relationships with many of the other numerical variables in the data set.


Figure 2.9 Heatmap of correlation for all numerical variables.

Industrial Data Analytics for Diagnosis and Prognosis

Подняться наверх