Читать книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen - Страница 23

Relationship Between A Numerical Variable and A Categorical Variable – Side-by-Side Box Plot

Side-by-side box plots can be used to show how the distribution of a numerical variable changes over different values of a categorical variable. The idea is to use a box plot to represent the distribution of the numerical variable at each value of the categorical variable. In Figure 2.5, we draw two side-by-side box plots for the auto_spec data set using the following R codes:

Figure 2.5 Side-by-side box plots.

oldpar <- par(mfrow = c(1, 2)) boxplot(auto.spec.df$compression.ratio ~ auto.spec.df$ fuel.type, xlab = "Fuel Type", ylab = "Compression Ratio") boxplot(auto.spec.df$highway.mpg ~ auto.spec.df$body. style, las = 2, xlab = "", ylab = "Highway MPG") mtext("Body Style", side = 3, line = 1) par(oldpar)

The left panel of Figure 2.5 shows how the numerical variable compression.ratio is related to the two values (diesel and gas) of fuel.type. It is clear from the side-by-side box plot that a car with diesel fuel has a much higher compression ratio than a car with gas fuel. This also explains the separate cluster of outliers in the histogram and box plot of compression.ratio that is observed in Figure 2.3. The right panel of Figure 2.5 shows how highway.mpg is related to the five values of body.style. It can be seen that a hatchback car is more likely to have higher highway MPG while a convertible tends to have lower highway MPG.

Industrial Data Analytics for Diagnosis and Prognosis

Подняться наверх