Читать книгу Data Science For Dummies - Lillian Pierson - Страница 83
Detecting outliers with univariate analysis
ОглавлениеUnivariate outlier detection is where you look at features in your dataset and inspect them individually for anomalous values. You can choose from two simple methods for doing this:
Tukey outlier labeling
Tukey boxplotting
Tukey boxplotting is an exploratory data analysis technique that’s useful for visualizing the distribution of data within a numeric variable by visualizing that distribution with quartiles. As you might guess, the Tukey boxplot was named after its inventor, John Tukey, an American mathematician who did most of his work back in the 1960s and 70s. Tukey outlier labeling refers to labeling data points (that lie beyond the minimum and maximum extremes of a box plot) as outliers.
It is cumbersome to use the Tukey method to manually calculate, identify, and label outliers, but if you want to do it, the trick is to look at how far the minimum and maximum values are from the 25 and 75 percentiles. The distance between the 1st quartile (at 25 percent) and the 3rd quartile (at 75 percent) is called the inter-quartile range (IQR), and it describes the data’s spread. When you look at a variable, consider its spread, its Q1 / Q3 values, and its minimum and maximum values to decide whether the variable is suspect for outliers.
Here’s a good rule of thumb:
a = Q1 – 1.5*IQR
and
b = Q3 + 1.5*IQR.
If your minimum value is less than a, or your maximum value is greater than b, the variable probably has outliers.
On the other hand, it is quite easy to generate a Tukey boxplot and spot outliers using Python or R. Each boxplot has whiskers that are set at 1.5*IQR. Any values that lie beyond these whiskers are outliers. Figure 4-7 shows outliers as they appear within a Tukey boxplot that was generated in Python.
Credit: Python for Data Science Essential Training Part 1, LinkedIn.com
FIGURE 4-7: Spotting outliers with a Tukey boxplot.