Читать книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen - Страница 16
2 Introduction to Data Visualization and Characterization
ОглавлениеBefore making a chess move, an experienced chess player first explores the positions of the pieces on the chess board for noticeable patterns such as opponent’s threats, special relationships between chess pieces, and the strengths and weaknesses of both sides, before digging into in-depth calculation of move sequences to find the optimal move. Similarly, a data scientist should also start with an exploration of the data set for noticeable patterns before conducting any in-depth analysis by building a sophisticated mathematical model or running a computationally intensive algorithm. Simple data exploration methods can help understand the basic data structure such as dimension and types of variables; discover initial patterns such as relationships among variables; identify missing values, outliers, and skewed distribution for the needs of data pre-processing and transformation. This chapter focuses on basic graphical and numerical methods for data description and exploration. We first look at a data set in the following example.
Example 2.1 (auto_spec
data) The data set in auto_spec.csv
, which is from the UCI Machine Learning Repository [Dua and Graff, 2017], contains the specifications of a sample of cars. The following R
codes can be used to read the data file and obtain information on basic characteristics and structure of the data set.
# load data
auto.spec.df <- read.csv ("auto_spec.csv", header = T)
# show basic information of data set
dim (auto.spec.df)
names (auto.spec.df)
head(auto.spec.df)
summary(auto.spec.df)> dim(auto.spec.df) [1] 205 23 > names(auto.spec.df) [1] "make" "fuel.type" "aspiration" [4] "num.of.doors" "body.style" "drive.wheels" [7] "engine.location" "wheel.base" "length" [10] "width" "height" "curb.weight" [13] "engine.type" "num.of.cylinders" "engine.size" [16] "fuel.system" "bore" "stroke" [19] "compression.ratio" "horsepower" "peak.rpm" [22] "city.mpg" "highway.mpg" > head(auto.spec.df) Make Fuel.type Aspiration Num.of.doors Body.style Drive wheels 1 Alfa-Romeo Gas Std Two Convertible Rwd 2 Alfa-Romeo Gas Std Two Convertible Rwd 3 Alfa-Romeo Gas Std Two Hatchback Rwd 4 Audi Gas Std Four Sedan Fwd 5 Audi Gas Std Four Sedan Fwd 6 Audi Gas Std Two Sedan Fwd .... Horsepower Peak.rpm City.mpg Highway.mpg 1 111 5000 21 27 2 111 5000 21 27 3 154 5000 19 26 4 102 5500 24 30 5 115 5500 18 22 6 110 5500 19 25 > summary(auto.spec.df) Make Fuel.type Aspiration Num.of.doors Body.style Toyota : 32 Diesel: 20 Std :168 Four:114 Convertible: 6 Nissan : 18 Gas : 185 Turbo: 37 two : 89 Hardtop : 8 Mazda : 17 NA’s: 2 Hatchback :70 Honda : 13 Sedan :96 Mitsubishi : 13 Wagon :25 Subaru : 12 (Other) : 100 .... City.mpg Highway.mpg Min. :13.00 Min. :16.00 1st Qu.:19.00 1st Qu.:25.00 Median :24.00 Median :30.00 Mean :25.22 Mean :30.75 3rd Qu.:30.00 3rd Qu.:34.00 Max. :49.00 Max. :54.00
From the R
outputs, we see that this data set contains 205 observations on 23 variables including manufacturer, fuel type, body style, dimension, horsepower, miles per gallon, and other specifications of a car. In statistics and data mining literature, an observation is also called a record, a data point, a case, a sample, an entity, an instance, or a subject, etc. The variables associated with an observation are also called attributes, fields, characteristics, or features, etc. The summary()
function shows the basic summary information of each variable such as the mean, median, and range of values. From the summary information, it is obvious that there are two types of variable. A variable such as fuel.type
and body.style
has a finite number of possible values, and there is no numerical relationship among the values. Such a variable is referred to as a categorical variable. On the other hand, a variable such as highway.mpg
and horsepower
has continuous numerical values, and is referred to as a numerical variable. Beyond the basic data summary, graphical methods can be used to show more patterns of both types of variables, as discussed in the following subsection.
Note from the results of summary()
, several variables in the auto_spec
data set have missing values, which are represented by NA
. Missing values are a common occurrence in real world data sets. There are various ways to handle the missing values in a data set. If the number of observations with missing values is small, those observations might be simply omitted. To do this, we can use the R
function na.omit()
. From the following R
codes we can see that there are 205 − 197 = 8 observations with missing values in this data set. So simply removing these observations is a reasonable way to handle the missing values for this data set.
> dim(na.omit(auto.spec.df))
[1] 197 23
If a significant number of observations in a data set have missing values, an alternative to simply removing observations with missing values is imputation, which is a process of replacing missing values with substituted values. A simple method of imputation is to replace missing values with a mean or median of the variable. More sophisticated procedures such as regression-based imputation do exist. These methods play important roles mainly in medical and scientific studies, where data collection from patients or subjects is often costly. In most industrial data analytics applications where data are typically abundant, simpler methods of handling missing values are usually sufficient.