Читать книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen - Страница 20

Distribution of Numerical Variables – Histogram and Box Plot

Оглавление

A histogram can be used to approximately represent the distribution of a numerical variable with continuous values. A histogram can be considered as a bar chart extended to continuous numerical variables. To draw a histogram, the entire range of the variable in the data set is divided into a number of consecutive equal sized intervals. Then a “bar” is shown for each interval to represent the number of observations in the interval.

Another commonly used plot that can represent distribution of a numerical variable is the box plot. We illustrate the basic elements of a box plot in Figure 2.2, which shows the box plot of the numerical variable width of the auto_spec data set. The bold line within the rectangle box represents the median value of the variable in the data set. The lower and upper bound of the box are corresponding to the first quartile (25th percentile) and the third quartile (75th percentile), respectively. The height of the box is the interquartile range (IQR), which is the distance between the first and the third quartile. The short horizontal lines above and below the box are called the whiskers, which represent the maximum and minimum of the values in the data set, excluding the “outliers”. In box plots, an outlier is typically defined as a data point that is either above the third quartile with a distance greater than 1.5 times of the IQR or below the first quartile with a distance greater than 1.5 times of IQR. The individual outliers are shown by the open circles in the box plot in Figure 2.2.


Figure 2.2 Elements of a box plot.

The R functions hist() and boxplot() can be used to plot the histogram and box plot, respectively. The following R codes plot, as shown in Figure 2.3, the histograms and box plots for three numerical variables, the length, horsepower, and compression.ratio, in the auto_spec data set.


Figure 2.3 Histograms and box plots of three numerical variables.

oldpar <- par(mfrow=c(2,3)) # split the plot into panels hist(auto.spec.df$length, xlab = "Length",

main = "Histogram of Length") hist(auto.spec.df$horsepower, xlab = "Horsepower", main = "Histogram of Horsepower") hist(auto.spec.df$compression.ratio, xlab = "Compression Ratio", main = "Histogram of Compression Ratio") boxplot(auto.spec.df$length, ylab = "Length", main = "Boxplot of Length") boxplot(auto.spec.df$horsepower, ylab = "Horsepower", main = "Boxplot of Horsepower") boxplot(auto.spec.df$compression.ratio, ylab = " Compression Ratio", main = "Boxplot of Compression Ratio") par(oldpar)

From the histogram and box plot of the variable length, it can be seen that the distribution of the car lengths in the data set has a fairly symmetric shape. In contrast, the distribution of horsepower is more skewed with a long (right) tail. The histogram of the compression ratios shows the existence of two groups or clusters of data, which is also indicated by the separate cluster of outliers with high compression ratios that can be seen in the box plot.

Industrial Data Analytics for Diagnosis and Prognosis

Подняться наверх