Читать книгу Data Cleaning - Ihab F. Ilyas - Страница 11

Оглавление

Outlier Detection

Quantitative error detection often targets data anomalies with respect to some definition of “normal” data values. While an exact definition of an outlier depends on the application, there are some commonly used definitions, such as “an outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [Hawkins 1980] and “an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett and Lewis 1994]. Different outlier detection methods will generate different “candidate” outliers, possibly along with some confidence scores.

Many applications of outlier detection exist. In the context of computer networks, different kinds of data, such as operating system calls and network traffic, are collected in large volumes. Outlier detection can help with detecting possible intrusions and malicious activities in the collected data. In the context of credit card fraud, unauthorized users often exhibit unusual spending patterns, such as a buying spree from a distant location. Fraud detection refers to the detection of criminal activities in such financial transactions. Last, in the case of medical data records, such as MRI scans, PET scans, and ECG time series, automatic identification of abnormal patterns or records in these data often signals the presence of a disease and can help with early diagnosis.

There are many challenges in detecting outliers. First, defining normal data patterns for a normative standard is challenging. For example, when data is generated from wearable devices, spikes in recorded readings might be considered normal if they are within a specific range dictated by the sensors used. On the other hand, spikes in salary values are probably interesting outliers when analyzing employee data. Therefore, understanding the assumptions and limitations of each outlier detection method is essential for choosing the right tool for a given application domain. Second, many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the “curse of dimensionality.” As the number of dimensions increases, it becomes increasingly difficult to accurately estimate the multidimensional distribution of the data points [Scott 2008], and the distances between points approach zero and become meaningless [Beyer et al. 1999]. We give some concrete examples and more details on these challenges in the next section.

In Section 2.1, we present a taxonomy of outlier detection techniques. We discuss in detail each of these categories in Section 2.2, 2.3, and 2.4, respectively. In Section 2.5, we discuss outlier detection techniques for high-dimensional data that address the “curse of dimensionality.”

2.1 A Taxonomy of Outlier Detection Methods

Outlier detection techniques mainly differ in how they define normal behavior. Figure 2.1 depicts the taxonomy we adopt to classify outlier detection techniques, which can be divided into three main categories: statistics-based outlier detection techniques, distance-based outlier detection techniques, and model-based outlier detection techniques [Aggarwal 2013, Chandola et al. 2009, Hodge and Austin 2004]. In this section, we give an overview of each category and their pros and cons, which we discuss in detail.

Statistics-Based Outlier Detection Methods. Statistics-based outlier detection techniques assume that the normal data points would appear in high probability regions of a stochastic model, while outliers would occur in the low probability regions of a stochastic model [Chandola et al. 2009]. There are two commonly used categories of approaches for statistics-based outlier detection. The first category is based on hypothesis testing methods, such as the Grubbs Test [Grubbs 1969] and the Tietjen-Moore Test [Tietjen and Moore 1972]; they usually calculate a test statistic, based on observed data points, which is used to determine whether the null hypothesis (there is no outlier in the dataset) should be rejected. The second category of statistics-based outlier detection techniques aims at fitting a distribution or inferring a probability density function (pdf) based on the observed data. Data points that have low probability according to the pdf are declared to be outliers. Techniques for fitting a distribution can be further divided into parametric approaches and non-parametric approaches. Parametric approaches for fitting a distribution assume that the data follows an underlying distribution and aim at finding the parameters of the distribution from the observed data. For example, assuming the data follows a normal distribution, parametric approaches would need to learn the mean and variance for the normal distribution. In contrast, nonparametric approaches make no assumption about the distribution that generates the data; instead, they infer the distribution from the data itself.

Figure 2.1 A taxonomy of outlier detection techniques.

There are advantages of statistics-based techniques.

1. If the underlying data follows a specific distribution, then the statistical outlier detection techniques can provide a statistical interpretation for discovered outliers.

2. Statistical techniques usually provide a score or a confidence interval for every data point, rather than making a binary decision. The score can be used as additional information while making a decision for a test data point.

3. Statistical techniques usually operate in an unsupervised fashion without any need for labeled training data.

There are also some disadvantages of statistics-based techniques.

1. Statistical techniques usually rely on the assumption that the data is generated from a particular distribution. This assumption often does not hold true, especially for high-dimensional real datasets.

2. Even when the statistical assumption can be reasonably justified, there are several hypothesis test statistics that can be applied to detect anomalies; choosing the best statistic is often not a straightforward task. In particular, constructing hypothesis tests for complex distributions that are required to fit high-dimensional datasets is nontrivial.

Distance-Based Outlier Detection Methods. Distance-based outlier detection techniques often define a distance between data points that is used for defining a normal behavior. For example, a normal data point should be close to many other data points, and data points that deviate from such normal behavior are declared outliers [Knorr and Ng 1998, 1999, Breunig et al. 2000]. Distance-based outlier detection methods can be further divided into global or local methods depending on the reference population used when determining whether a point is an outlier. A global distance-based outlier detection method determines whether a point is an outlier based on the distance between that data point and all other data points in the dataset. On the other hand, a local method considers the distance between a point and its neighborhood points when determining outliers. There are advantages of distance-based techniques.

1. A major advantage of distance-based techniques is that they are unsupervised in nature and do not make any assumptions regarding the generative distribution for the data. Instead, they are purely data driven.

2. Adapting distance-based techniques to different data types is straightforward, and primarily requires defining an appropriate distance measure for the given data.

There are disadvantages of distance-based techniques.

1. If the data has normal instances that do not have enough close neighbors, or if the data has anomalies that have enough close data points, distance-based techniques will fail to label them correctly.

2. The computational complexity of the testing phase is also a significant challenge since it involves computing the distance of every pair of data points.

3. Performance of a nearest neighbor-based technique greatly relies on a distance measure, defined between a pair of data instances, which can effectively distinguish between normal and anomalous instances. Defining distance measures between instances can be challenging when the data is complex, for example, graphs, sequences, and so on.

Model-Based Outlier Detection Methods. Model-based outlier detection techniques first learn a classifier model from a set of labeled data points, and then apply the trained classifier to a test data point to determine whether it is an outlier. Model-based approaches assume that a classifier can be trained to distinguish between the normal data points and the anomalous data points using the given feature space. They label data points as outliers if none of the learned models classify them as normal points. Based on the labels available to train the classifier, model-based approaches can be further divided into two subcategories: multi-class model-based techniques and one-class model-based techniques. Multi-class modelbased techniques assume that the training data points contain labeled instances belonging to multiple normal classes. On the other hand, one-class model-based techniques assume that all the training data points belong to one normal class. The advantages of model-based techniques include:

1. Model-based techniques, especially the multi-class techniques, can make use of powerful algorithms that can distinguish between instances belonging to different classes.

2. The testing phase of model-based techniques is fast, since each test instance needs to be compared against the precomputed model.

The major disadvantages of model-based techniques is that they must rely on the availability of accurate labels for various normal classes, which is often not possible.

2.2 Statistics-Based Outlier Detection

In this section, we discuss statistics-based outlier detection methods. First, we review some basics about data distributions in Section 2.2.1. Section 2.2.2 then shows how hypothesis testings, such as Grubbs’ Test, are used for detecting outliers. In Section 2.2.3, we present parametric approaches for fitting a distribution to the data, and also introduce the notion of robust statistics, which can capture important properties of the underlying distribution in the presence of outliers in the data. In Section 2.2.4, we introduce nonparametric approaches for fitting a distribution to the data by using histograms and kernel density estimations (KDEs).

2.2.1 A Note on Data Distribution

Database practitioners usually think of data as a collection of records. They are interested in descriptive statistics about the data (e.g., “what is the min, max, and average salary for all the employees?”), and they usually do not care much about how the data is generated. On the other hand, statisticians usually treat data as a sample of some data-generating process. In many cases, they try to find a model or a distribution of that process that best describes the available sample data.

The most commonly used distribution is the Gaussian or normal distribution, which is characterized by a mean μ and a standard variance σ [Barnett and Lewis 1994]. While the normal distribution might not be the exact distribution of a given dataset, it is the cornerstone of modern statistics, and it accurately models the sum of many independently identically distributed variables according to the central limit theorem. The following equation shows the pdf of the normal distribution N (μ, σ²):

Multivariate Gaussian distribution or multivariate normal distribution is a generalization of the univariate normal distribution to higher dimensions. A random vector consisting of d random variables Xi, where i ∊ [1, d], is said to be d-variate normally distributed if every linear combination of its d components has a univariate normal distribution. The mean of the d variables is a d-dimensional vector μ = (μ₁, μ₂, …, μd)^T, where μi is the mean of the random variable Xi. The covariance matrix Σ of these d random variables (also known as dispersion matrix or variance-covariance matrix) is a matrix, where Σ_ί,j (Row i and Column j of Σ) is the covariance between the two random variables Xi and Xj, namely, Σ_{i, j} = cov(Xi, Xj) = E[(Xi − μi − μj]. Intuitively, the covariance matrix generalizes the notion of variance of a single random variable or the covariance of two random variables to d dimensions.

2.2.2 Hypothesis Testing for Outlier Detection

A statistical hypothesis is an assumption about a population that may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses, which usually consists of the following steps [Lehmann and Romano 2006]: (1) formulate the null and the alternative hypothesis; (2) decide which test is appropriate and state the relevant test statistic T; (3) choose a significance level α, namely, a probability threshold below which the null hypothesis will be rejected; (4) determine the critical region for the test statistic T, namely, the range of values of the test statistic for which the null hypothesis is rejected; (5) compute from the current data the observed value tobs of the test statistic T; and (6) reject the null hypothesis if tobs falls under the critical region. Different known test statistics exist for testing different properties of a population. For example, the chi-square test for independence is used to determine whether there is a significant relationship between two categorical variables, the one-sample t-test is commonly used to determine whether the hypothesized mean differs significantly from the observed sample mean, and the two-sample t-test is used to determine whether the difference between two means found in two samples is significantly different from the hypothesized difference between two means.

A number of formal hypothesis tests for detecting outliers have been proposed in the literature. These tests usually differ according to the following characteristics.

• What is the underlying distribution model for the data?

• Is the test designed for testing a single outlier or multiple outliers?

• If the test is for detecting multiple outliers, does the number of outliers need to be specified exactly in the test?

For example, the Grubbs Test [Grubbs 1969] is used to detect a single outlier in a univariate dataset that follows an approximately normal distribution. The Tietjen-Moore Test [Tietjen and Moore 1972] is a generalization of the Grubbs test to the case of more than one outlier; however, it requires the number of outliers to be specified exactly. The Generalized Extreme Studentized Deviate (ESD) Test [Rosner 1983] requires only an upper bound on the suspected number of outliers and is often used when the exact number of outliers is unknown.

We use Grubbs’ test as an example to show how hypothesis testing is used for detecting outliers, as it is a standard test when testing for a single outlier. This outlier is expunged from the dataset and the test is repeated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test is not to be used for a small sample size. Grubbs’ test is defined for the following hypothesis, where H₀ is the null hypothesis and Ha is the alternative hypothesis:

H0 : there are no outliers in the dataset

Ha : there is exactly one outlier in the dataset.

Grubbs’ test statistic is defined as with and s denoting the sample mean and standard deviation, respectively. Grubbs’ test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation. The above test statistic is the two-sided version of the test. Grubbs’ test can also be used as a one-sided test for determining whether the minimum value is an outlier or the maximum value is an outlier. To test whether the minimum value Ymin is an outlier, the test statistic is defined as ; similarly, to test whether the maximum value Ymax is an outlier, the test statistic is defined as . For the two-sided test, the null hypothesis of no outliers is rejected if with denoting the upper critical value of the t-distribution with N − 2 degrees of freedom and a significance level of α/2N. For a one-sided test, replace α/2N with α/N.

Table 2.1 Employee records, including name, age, income, and tax

Example 2.1 Consider Table 2.1, including the name, age, income, and tax of employees. Domain knowledge suggests that the first value t₁[age] and the last value t₉[age] are outlying age values. We use Grubbs’ test with a significance level α = 0.05 to identify outliers in the age column.

The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. Grubbs’ test statistic which is obtained at t₉. With a significance level α = 0.05, the critical value is computed as G_critical = 2.21. Since G > G_critical, the null hypothesis is rejected. Therefore, t₉[age] is reported as an outlier.

Removing t₉[age], we are left with 8 age values. The mean of the 8 age values is 28.88, and the standard deviation of the 8 age values is 12.62. Grubbs’ test statistic is , which is obtained at t₁. With a significance level α = 0.05, the critical value is computed to G_critical = 2.12. Since G > G_critical, the null hypothesis is rejected. Therefore, t₁[age] is reported as an outlier.

Removing t₁[age], we are left with 7 age values. The mean of the 7 age values is 32.86, and the standard deviation of the 7 age values is 6.15. Grubbs’ test statistic , which is obtained at t₈. With a significance level α = 0.05, the critical value is computed as G_critical = 2.01. Since G < G_critical, the null hypothesis is accepted.

Previous discussion assumes that the data follows an approximately normal distribution. To assess this, several graphical techniques can be used, including the Normal Probability Plot, the Run Sequence Plot, the Histogram, the Box Plot, and the Lag Plot.

Iglewicz and Hoaglin provide an extensive discussion of the outlier tests previously given [Iglewicz and Hoaglin 1993]. Barnett and Lewis [1994] provide a book length treatment of the subject. They provide additional tests when data is not normally distributed.

2.2.3 Fitting Distribution: Parametric Approaches

The other type of statistics-based approach first fits a statistical distribution to describe the normal behavior of the given data points, and then applies a statistical inference procedure to determine if a certain data point belongs to the learned model. Data points that have a low probability according to the learned statistical model are declared as anomalous outliers. In this section, we discuss parametric approaches for fitting a distribution to the data.

Univariate

We first consider univariate outlier detection, for example, for a set of values x₁, x₂, …, xn that appear in one column of a relational table. Assuming the data follows a normal distribution, fitting the values under a normal distribution essentially means computing the mean μ and the standard deviation σ from the current data points x₁, x₂, …, xn. Given μ and σ, a simple way to identify outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from the mean, namely z-score. Data values that have a z-score greater than a threshold, for example, of three, are declared to be outliers.

Since there might be outliers among x₁, x₂, … xn, the estimated μ and σ might be far off from their actual values, resulting in missing outliers in the data, as we show in Example 2.2.

Example 2.2 Consider again the age column in Table 2.1. The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. The procedure that identifies values that are more than 2 standard deviations away from the mean as outliers would mark values that are not in the range of [136.78 − 2 * 323.92, 136.78 + 2 * 323.92] = [–511.06, 784.62]. The last value t₉[age] is not in the range, and thus is correctly marked as an outlier. The first value t₁[age], however, is in the range and is thus missed.

This effect is called masking [Hellerstein 2008]; that is, a single data point has severely shifted the mean and standard deviation so much as to mask other outliers. To mitigate the effect of masking, robust statistics are often employed, which can correctly capture important properties of the underlying distribution even in the face of many outliers in the data values. Intuitively, the breakdown point of an estimator is the proportion of incorrect data values (e.g., arbitrarily large or small values) an estimator can tolerate before giving an incorrect estimate. The mean and standard deviation have the lowest breakdown point: a single bad value can distort the mean completely.

Robust Univariate Statistics. We now introduce two robust statistics: the median and the median absolute deviation (MAD) that can replace mean and standard deviation, respectively. The median of a set of n data points is the data point for which half of the data points are smaller, and half are larger; in the case of an even number of data points, the median is the average of the middle two data points. The median, also known as the 50th percentile, is of critical importance in robust statistics with a breakdown point of 50%; as long as no more than half the data are outliers, the median will not give an arbitrarily bad result. The median absolute deviation (MAD) is defined as the median of the absolute deviations from the data’s median, namely, MAD = mediani(|xi − medianj (xj)|). Similar to the median, MAD is a more robust statistic than the standard deviation. In the calculation of the standard deviation, the distances from xi to the mean are squared, so large deviations, which often are caused by outliers, are weighted heavily, while in the calculation of MAD, the deviations of a small number of outliers are irrelevant because MAD is using the median of the absolute deviations.

The median and MAD lead to a robust outlier detection technique known as Hampel X84 [Hampel et al. 2011] that is quite reliable in the face of outliers since it has a breakdown point of 50%. Hampel X84 marks outliers as those data points that are more than 1.4826θ MADs away from the median, where θ is the number of standard deviations away from the mean one would have used if there were no outliers in the dataset. The constant 1.4826 is derived under a normal distribution, where one standard deviation away from the mean is about 1.4826 MADs.

Example 2.3 For detecting outliers in the ages column in Table 2.1, we used 2 standard deviations away from the mean in Example 2.2. Therefore, we would like to flag points that are 1.4826 * 2 = 2.9652 away from the median as outliers.

The median of the set of values in Example 2.2 is 32 and the MAD is 7. The normal value range would be [32 − 7 * 2.9652, 32 + 7 * 2.9652] = [11.2436, 52.7564], which is a much more reasonable range than [−511.06, 784.62] derived using mean and standard deviation. Under the new normal range, the first value and the last value are now correctly flagged as outliers.

Multivariate

So far, we have considered detecting univariate outliers in one dimension by fitting the data to a normal distribution. However, some outliers can only be revealed when considering multiple dimensions; we call these multivariate outliers.

Example 2.4 Consider the income and tax columns in Table 2.1. In general, income varies for different persons, and so do tax rates. However, one would expect a positive correlation between income and tax rate, namely, a higher income is usually associated with a higher tax rate. Therefore, a tax record that has a high income but low tax rate should be flagged as an outlier, even though the tax income or the tax rate in that record might not be an outlier when considered separately in their respective columns. Consider t₅ in Table 2.1. t₅[income] is not an outlier compared with the values in the income column, and t₅[tax] is also not an outlier compared with the values in the tax column. However, taking the income and tax columns into account simultaneously, t₅ is an outlier, since it has a relatively low tax for a relatively high income value.

Suppose we have a table that has n rows and d columns; we model each column as a random variable Xi, ∀i ∈ [1, d]. Let the mean of the d variables be a d-dimensional vector μ = (μ₁, μ₂, …, μd)^T and the covariance matrix be Σ, where Σ i,j is the covariance between the two random variables Xi and Xj. Given the mean and the covariance matrix, the questions are how to measure the distance between a particular data point to the mean using the covariance matrix, and how much distance should qualify a point as an outlier. The Mahalanobis distance [Mahalanobis 1936] is the multidimensional generalization of measuring how many standard deviations away a point is from the mean of the population. Specifically, the Mahalanobis distance of a point x to the mean vector μ is defined as Mahalanobis (x) = . Notice that if Σ is the identity matrix, then the Mahalanobis distance reduces to the standard Euclidean distance between x and μ. If every dimension is a normal distribution, then the square of the Mahalanobis distance follows a distribution with d degrees of freedom. Using the probability cumulative distribution function of , we define the threshold of the Mahalanobis distance to be a number where most of the Mahalanobis distance (e.g., 95%) is smaller than that threshold distance.

Robust Multivariate Statistics. Similar to the univariate mean and variance, the multivariate mean and covariance matrix are not robust, i.e., a single bad data point might severely distort the mean and the covariance matrix, and thus robust estimators must be used. The most famous method for the robustification of multivariate mean and covariance matrix is called the Minimum Covariance Determinant (MCD) by Rousseeuw and Driessen [1999]. MCD chooses a subset of h points that minimizes the determinant of the covariance matrix over all possible subsets of size h (h is usually chosen to reflect the belief about how many points in the dataset are outliers). For example, assume 10% of n points are outliers, then h is chosen to be 0.9n. The multivariate mean and covariance matrix of the h points are used to compute the Mahalanobis distance for all n points. A brute force way to find the covariance matrix with the smallest determinant is to enumerate all possible subsets of size h from the n data points, which is obviously very expensive as one has to evaluate all the possible subsets. A greedy algorithm called FASTMCD [Rousseeuw and Driessen 1999] tries to solve the efficiency problem. Algorithm 2.1 describes FASTMCD. It starts by randomly selecting a subset of h data points, and the mean μ and the covariance matrix Σ are computed using the random sample. Then, the Mahalanobis distances for all points in I are computed using the μ and Σ. The random sample is updated using h points with the smallest Mahalanobis distances. The sample is updated in iterations until it remains unchanged between two iterations. The above process is repeated by using a different random sample as the initial seed, and the best result is used.

Algorithm 2.1 FASTMCD

The correctness of the algorithm can be proven by showing that det(H1) ≤ det (H₀) in the iterative portion. The mean and the covariance matrix computed using FASTMCD can be shown to have a 50% breakdown point [Rousseeuw and Driessen 1999].

Most of our discussions of parametric approaches have been based on normal distributions. In practice, not all datasets are normally distributed. Two other distributions are frequently observed: multimodal distributions and Zipfian distributions. In some cases, data appears to have many “peaks,” such distributions are typically referred as being multimodal. Oftentimes, these multimodal distributions can be seen as a superimposed normal distributions, known as a mixture of Gaussians. In some other cases, a large fraction of the data is condensed into a small fraction of values, while the remainder of the data values are spread across a long tail of rare values. The Zipfian distribution exhibits such a phenomenon. It is important to choose the right distribution to detect outliers using parametric approaches.

2.2.4 Fitting Distribution: Nonparametric Approaches

An obvious drawback of using parametric approaches for fitting a distribution is that they assume the data to follow an underlying distribution. In contrast, non-parametric approaches make no assumption about the distribution that generates the data; instead, they infer the distribution from the data itself. There are mainly two types of techniques in this category: histogram-based techniques and kernel density-based techniques.

Histograms

A histogram [Pearson 1894] is a graphical representation of the distribution of numerical data values, and is often used to estimate the probability distribution of a continuous random variable. The first step toward creating a histogram is to discretize or bin the range of values by dividing the range of the random variable into a series of intervals, which are usually consecutive and non-overlapping. The second step is to count how many values fall under each bin. If the bins are of equal size, a bin is created with height proportional to the frequency of each bin, i.e., the number of values a bin contains; this type of histogram is called an equi-width histogram. In general, however, bins can be of varying width. If the width of bins are chosen such that every bin has the same frequency, the histogram is called an equi-depth histogram.

Equi-width histograms can be used to detect outliers; data points that belong to bins that have very low frequency are reported as outliers. A major challenge with histogram-based outlier detection approaches is that it is often very difficult to come up with the “right” width of the interval. If the width of the bins is too narrow, the frequency of some bins might be low, and normal data points belong to those bins are reported as outliers, and if the width of the bins is too wide, outlier data points might get absorbed in bins that have normal data points, and thus fail to report the real outliers. The problem is further exacerbated when adapting histograms to multivariate data, since the frequency of every division diminishes as the number of dimensions increases, as we mentioned earlier as a curse of dimensionality.

Figure 2.2 Equi-width histograms for outlier detection.

Example 2.5 Figure 2.2 shows two equi-width histograms we built for the age column in Table 2.1 using free statistics software [Wessa 2012]. We report data points residing in bins that contain less than three data points as outliers.

In Figure 2.2(a), the age values are split into 10 equal-width bins, namely (0, 100], (100, 200], …, (900, 1000]. Only the first bin and the last bin contain data points. Specifically, bin (0, 100] contains 8 points and bin (900, 1000] contains 1 point. Therefore, we report t9[age] ∈ (900, 1000] as an outlier.

In Figure 2.2(b), the age values are split into 50 equal-width bins, namely (0, 20], (20, 40], …, (980, 1000]. Only the first three bins and the last bin contain data points. Specifically, bin (0, 20] contains 1 point, bin (20, 40] contains 6 points, bin (40, 60] contains 6 points, and bin (980, 1000] contains 1 point. Therefore, we report t₁[age] ∈ (0, 20], t₈[age] ∈ (40, 60], and t₉[age] ∈ (980, 1000] as outliers.

There are usually two ways to generalize a histogram to deal with multivariate data: (1) computing an outlier score for each dimension using the one-dimensional histogram, and then combining the score to obtain an overall outlier score for every data point; and (2) binning every dimension to divide multiple dimensions together; the number of data points belonging to each division is counted, and the data points that belong to divisions with very low frequency counts are reported as outliers. The performance of histogram methods usually deteriorates with increasing number of dimensions due to the sparsity of the grid structure with increasing dimensionality [Aggarwal 2013]. For example, given a 10-dimensional space with each dimension being split into 10 bins, the grid will contain 10¹⁰ grid cells. It would require 10¹⁰ available data points to allow every grid cell to have one data point in expectation.

Kernel Density Estimation

KDE [Rosenblatt 1956, Parzen 1962, Gan and Bailis 2017] is a nonparametric way to estimate the pdf f (x) of a random variable X based on a sample of n data points, which is the same as histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

Let x₁, x₂, …, xn be an independent and identically distributed sample drawn from some distribution with unknown pdf f (x). KDE estimates f (x) using defined as follows:

where K(•) is the kernel, and h is a smoothing parameter called the bandwidth. A kernel has to be a non-negative function that integrates to one and has mean zero, namely, and K(x) = K(−x). A kernel with a subscript h is a scaled kernel that satisfies Kh(x) = 1/hK (x/h). Some commonly used kernel functions include the normal, uniform, triangular, biweight, triweight, and Epanechnikov kernels. Just as the choice of the width of the bin influences the result of histograms, the bandwidth of the kernel is a free parameter strongly influencing the KDE as well.

Example 2.6 To compare KDEs with histograms, consider 9 data points x₁ = 1, x₂ = 1, x₃ = 2, x₄ = 2, x₅ = 2, x₆ = 4, x₇ = 6, x₈ = 9, and x₉ = 10. Figure 2.3 compares the pdf derived using both the equi-width histogram with a bin width equal to 2 and three KDEs using different kernel functions and bandwidths. As we can see, KDEs in general produce much smoother estimates than the histogram. Comparing Figures 2.3(b), 2.3(c), and 2.3(d), we can see that bandwidth clearly controls the smoothness of the pdf produced. Given the estimated pdf, any points that have low probability are flagged as outliers.

The generalization of multivariate KDE from univariate KDE is similar to the generalization of multivariate histograms from univariate histograms, and is discussed in detail in Simonoff [2012].

Figure 2.3 Compare histograms with KDEs. Graphs were produced using Wessa [2012].

2.3 Distance-Based Outlier Detection

In contrast to statistics-based outlier detection techniques, distance-based outlier techniques do not assume an underlying model that generates the data and are often nonparametric. These approaches often define a distance between data points, which is used for defining a normal behavior, such as normal data points should be close to many other data points, and any data point that deviates from such normal behavior is declared an outlier [Knorr and Ng 1998, 1999, Breunig et al. 2000].

Distance-based outlier detection methods can be further divided into global or local methods depending on the reference population used when determining whether a point is an outlier. A global distance-based outlier detection method determines whether a point is an outlier based on the distance between that data point and all other data points in the dataset. On the other hand, a local method considers the distance between a point and its neighborhood points when determining outliers.

2.3.1 Global Distance-Based Outlier Detection

Distance-based methods were first introduced by Knorr and Ng [Knorr and Ng 1998, 1999]. An object O in a dataset I is a distance-based outlier, i.e., DB(p, D) outlier, if at least fraction p of the objects in I lies greater than distance D from O. DB(p, D) can be seen as a generalization of some existing statistics-based outlier detection definitions. For example, let I be observations from an univariate normal distribution N (μ, δ) and O be a point from I, then the z-score of O is greater than 3 if and only if O is a DB(0.9988, 0.13δ) outlier [Knorr and Ng 1998]. Knorr and Ng proved that statistics-based outlier detection definitions based on other underlying distributions, such as Student t-distribution and Poisson distribution, are also equivalent to DB(p, D) outliers with suitable choice of p and D.

The definition of DB(p, D) does not provide a ranking or a score of outliers and requires specifying a distance parameter D, which could be difficult to determine and may involve a trial and error process. There are other distance-based outlier definitions that look at the k nearest neighborhood of a data point. Kollios et al. [2003] defines an object O as outlier in a dataset I if at most k objects in I lie at distance at most D from O. Ramaswamy et al. [2000] defines outliers as the top few data elements whose distance to the kth nearest neighbor is greatest.

The simplest category of algorithms [Knorr and Ng 1998, 1999, Ramaswamy et al. 2000] for finding distance-based outliers are those using nested loops to find the distance between every point and every other point, which has a quadratic complexity. Another category of algorithms uses spatial index structures such as KD-trees, R-trees, or X-trees to find the nearest neighbors of each point [Knorr and Ng 1998, 1999, Ramaswamy et al. 2000]. This category of algorithms works well for low-dimensional datasets, and has a complexity of O(n log n) if the index tree can allow a O(log n) lookup for finding a point’s nearest neighbor. However, the index structures become increasingly complex as the dimensionality increases. For example, Breunig et al. [2000] used an X-tree variant to do a nearest neighbor search; they found that the index performed well for fewer than 5 dimensions, and the performance dropped dramatically for just 10–20 dimensions. A third category of algorithms partitions the space into regions to enable a faster search of nearest neighbors. For each region, certain summary statistics such as the minimum bounding rectangle are stored. During the search procedure for finding nearest neighbors of a point, the point is compared with the summary statistics of a region to determine if it is possible to skip all the points in that region. Knorr and Ng [1998] found out that the partitioning-based approach has a complexity linear in O(n), but exponential in the number of dimensions.

An efficient algorithm for finding distance-based outliers was devised by Bay and Schwabacher [2003]; it is an adaptation of the simple nested loop algorithm and has an expected running time of O(n). Algorithm 2.2 describes the procedure. The distance function between two points x and y is denoted as distance(x, y), such as Euclidean distance for numerical values and Hamming distance for categorical values. The function score(x, Y) denotes the nearest neighbor distances between x and its neighbors Y and is monotonically decreasing.

Algorithm 2.2 partitions all the points I into a set of blocks I₁, …, Ib, and compares each point in every block to every point in I. It keeps track of the top m outliers and the weakest outlier of the m outliers, namely, the point with the smallest score. The main idea in the nested loop is to keep track of the closest k neighbors found so far for each point x. When a point x achieves a score lower than the cutoff c, we can safely remove the point x since x can no longer be an outlier. As the loop continues, the algorithm finds more outliers, and the cutoff threshold c increases along with the pruning power. In the worst case, the algorithm still has a complexity of O(n²); however, the average case analysis reveals the algorithm has an expected complexity of O(n) [Bay and Schwabacher 2003].

2.3.2 Local Distance-Based Outlier Detection

Distance-based outlier detection methods such as DB(p, D) and its variants may miss certain kinds of outliers because every point is compared with every other point in the dataset.

Example 2.7 Consider a two-dimensional dataset with data points shown in Figure 2.4. There are two clusters of points, C₁ and C₂, and points in C₁ are sparse, whereas points in C₂ are quite dense. Obviously, points o₁ and o₂ should be reported as outliers. However, distance-based methods would not flag o₂ as an outlier before flagging any points in C₁, since the distances between o₂ are much closer to the majority of the data points, i.e., C₂, than those points in C₁.

The local outlier factor (LOF) method [Breunig et al. 2000] scores data points based on the density of their neighboring data points, and can capture outlier points such as o₂ in Figure 2.4.

Given a positive integer k, the k-distance of an object p, denoted as k-distance(p), is defined as the distance distance(p, o) between p and another object o, such that: (1) there exist at least k objects other than p in the dataset whose distance to p is less than or equal to distance(p, o); and (2) there exist at most k − 1 objects other than p in the dataset whose distance to p is strictly less than distance(p, o). Given k-distance(p), the k-distance neighborhood of p, denoted as Nk(p), contains all objects other than p whose distance to p is less than or equal to k-distance(p). The reachability distance of an object p with respect to object o is defined as reach-distk(p, o) = max {k-distance(o), d(p, o)}. Intuitively, if an object p is too far away from o, then reach-distk(p, o) is the actual distance between p and o. On the other hand, if p and o are sufficiently close, then reach-distk(p, o) is replaced by the k-distance of o. In this way, the distance fluctuations of all objects close to o are reduced; the higher the k, the greater the smoothing effect.

Algorithm 2.2 Randomized algorithm for finding distance-based outliers

Definition 2.1 The local reachability density of p is defined as:

Figure 2.4 An example showing distance-based outlier detection failure [Breunig et al. 2000].

The local outlier factor of p is defined as:

Intuitively, the local reachability density of an object p is the inverse of the average reachability distance based on the k nearest neighbors of p. The local outlier factor of p is the average of the ratio of the local reachability density of p and those points in the k-distance neighborhood of p. It has been shown [Breunig et al. 2000] that the LOF definition has many desirable properties; for example, LOFk(p) is approximately equal to 1 in a cluster (a region with homogeneous density around the point and its neighbors), LOFk(p) ≫ 1 if p is an outlier.

There are many proposed variants of LOF. Jin et al. [2001] only mine for the top outliers, and thus do not need to compute the LOF scores for all data points by deriving bounds for the LOF scores. Tang et al. [2002] consider connectivity based outlier factors (COF) since LOF may make it hard to detect outliers in low density regions. Jin et al. [2006] take symmetric neighborhood into account by considering both the k-nearest neighborhood of p and its reverse k-nearest neighborhood. Papadimitriou et al. [2003] aim at getting rid of choosing the parameter k by considering all the possible ks.

2.4 Model-Based Outlier Detection

Model-based outlier detection techniques first learn a classifier model from a set of labeled data points, and then apply the trained classifier to a test data point to determine whether it is an outlier. Model-based approaches assume that a classifier can be trained to distinguish between the normal data points and the anomalous data points using the given feature space.

Based on the labels available to train the classifier, model-based approaches can be further divided into two subcategories: multi-class model-based techniques and one-class model-based techniques. Multi-class model-based techniques assume that the training data points contain labeled instances belonging to multiple normal classes [De Stefano et al. 2000]. Multiple classifiers are trained, where each classifier is supposed to distinguish between one of the normal classes and the rest of the classes. A test data point is declared as an outlier if it is not classified as belonging to any of the normal classes by any of the classifiers. In contrast, one-class model-based techniques assume that all the training data points belong to one normal class. Such techniques learn a discriminative boundary to distinguish between all the normal data points and the abnormal data points using a one-class classification algorithm, such as one-class support vector machines (SVMs) [Schölkopf et al. 2001, Ratsch et al. 2002]. Any test data point that does not fall within the learned boundary is declared an outlier.

Example 2.8 Figure 2.5(a) shows an example of a multi-class anomaly detection case where there are three normal classes. Three data points that do not belong to any of the three normal classes are flagged as outliers.

Figure 2.5(b) shows an example case of one-class anomaly detection. The classifier learns a boundary for the normal data points. Data points that lie outside the boundary are flagged as outliers.

Figure 2.5 Model-based outlier detection techniques [Chandola et al. 2009].

Given that there are many different classification algorithms, such as neural networks, Bayesian networks, SVMs, and decision trees, a variety of model-based outlier detection techniques have been developed. Basic neural networks have been used as a multi-class model-based outlier detection method [Taylor and Addison 2000, De Stefano et al. 2000]. Variants of the basic neural networks have been proposed that use different types of neural networks, such as replicator neural networks [Hawkins et al. 2002]. SVMs have been used as one-class model-based outlier detection techniques. Such techniques learn a region that contains the training data points [Ratsch et al. 2002]. Different kernels, such as Gaussian kernels and radial basis function kernels, can be used to learn complex regions. A test data point is declared as an outlier if it falls outside the boundary. Different variants of the basic SVMs have been proposed for outlier detection in audio signal data [Davy and Godsill 2002] and in temporal data [Ma and Perkins 2003]. Robust support vector machines have also been proposed to detect the boundary in the presence of outliers in the training data [Song et al. 2002]. Rule-based models such as decision trees have also been used for outlier detection [Joshi et al. 2001, Fan et al. 2004]. Each learned rule has an associated confidence value that is based on the number of training normal data points correctly classified by the rule and the total number of training normal data points covered by the rule. If a test point is not captured by any of the learned rules, it is declared an outlier.

2.5 Outlier Detection in High-Dimensional Data

Real datasets can be high dimensional; some may contain hundreds or even thousands of attributes (e.g., the large number of sensor readings in an airplane). Many outlier detection techniques lose their effectiveness when the number of dimensions (attributes) of the dataset is large; this effect is commonly known as the “curse of dimensionality.” For statistics-based outlier detection approaches, it becomes increasingly difficult to accurately estimate the multidimensional distribution of the data points [Scott 2008] as the number of dimensions increases. For distance-based approaches, the distances between data points approach zero and become meaningless [Beyer et al. 1999] with increasing dimensionality.

One popular category of techniques to deal with high-dimensional data is by using dimensionality reduction, which refers to the process of finding a low-dimensional representation of a high-dimensional dataset that preserves certain properties of the dataset, such as distances or similarities between data points. This topic has been well studied and surveyed in statistics, ML, and data mining [Cunningham and Ghahramani 2015, Fodor 2002, Ding et al. 2008]. Consider a set of m real-valued d-dimensional data points represented as an m × d matrix X, where each row corresponds to a data point xi ∈ ℝ^d. The goal of dimensionality reduction is to devise a transformation function T : ℝ^d → ℝ^k that maps each data point xi to a new data point in a k-dimensional space (k < d), such that the transformed data preserves some properties of interest. We denote by the transformed data point of xi and by the transformed dataset. One of the simplest methods to perform dimensionality reduction is through random projections [Achlioptas 2001]. To form a new data point from xi, k random d-dimensional vectors are generated to form a matrix T ∈ ℝ^k×d. The new dataset can be computed as . It is shown that random projections preserve pairwise distances between vectors [Achlioptas 2001].

Principal Component Analysis (PCA) [Pearson 1901, Hotelling 1933] is perhaps the most popular statistical procedure to perform dimensionality reduction. PCA uses orthogonal transformation to convert a set of data points of possibly correlated dimensions into a set of data points of linearly uncorrelated dimensions, called principal components. The transformation is done in such a way that the first component accounts for the largest possible variance in the data, and each succeeding component has the largest variance possible given that it is orthogonal to all preceding components. PCA can be done by eigenvalue decomposition of the covariance matrix or singular value decomposition (SVD) [Shlens 2014]. To reduce the complexity of computing PCA via SVD, progressive sampling strategies are used [Suri and Bailis 2017].

Dimensionality reduction techniques use all of the available dimensions of the original dataset and aim to find new datasets that preserve certain properties. In what follows, we focus on two additional categories of approaches that discover outliers using a subset of dimensions from the original dataset: subspace outlier detection techniques and contextual outlier detection techniques. Subspace outlier detection techniques select one or more subsets of attributes from all attributes and find outliers with respect to every selected subset of attributes, namely, a subspace [Aggarwal and Yu 2001, Zhang et al. 2004, Muller et al. 2008, Kriegel et al. 2009]. Contextual outlier detection techniques select from all attributes one or more pairs of subsets of attributes, where each pair of subsets consists of a subset of environmental attributes (also referred to as contextual attributes) and a subset of behavioral attributes (also referred to as indicator attributes or metric attributes). The environmental attributes are used to select subsets of tuples from all tuples, where each subset of tuples is referred to as a context. Outliers are detected within each context with respect to the behavioral attributes [Wei et al. 2003, Zhu et al. 2004, Angiulli et al. 2009, Tang et al. 2013, Liang and Parthasarathy 2016].

Example 2.9 Consider again the employee records in Table 2.1. We know from Example 2.4 that t₅ is a multivariate outlier with respect to the income and tax attributes. Example 2.4 assumes that the income and tax attributes are given as input. However, in reality, we are often only given a table with many attributes, and we have no knowledge of which subsets of attributes would reveal interesting outliers.

Given Table 2.1, a subspace outlier detection algorithm would first identify the income and the tax attributes as a subspace, and then discover t₅ as an outlier in that subspace.

A contextual outlier detection algorithm would identify income to be the environmental attributes and tax to be the behavioral attributes. t₅ is then reported as a contextual outlier with respect to the tax attribute within the context income > 100; in other words, among the tuples with income > 1000, t₅ has an abnormal tax.

In Example 2.9, the same outlier could be detected by both subspace outlier detection techniques and contextual outlier detection techniques. Subspace outlier detection techniques usually need to enumerate all potentially interesting subspaces. Contextual outlier detection techniques not only need to enumerate possible environmental attributes and behavioral attributes, but also need to enumerate contexts based on the environmental attributes. Contextual outliers are generally more interpretable than subspace outliers (e.g., t₅ in Example 2.9).

We first lay out two commonly found use cases for high-dimensional outlier detections in Section 2.5.1. We then discuss in detail subspace outlier detection techniques in Section 2.5.2 and contextual outlier detection techniques in Section 2.5.3.

2.5.1 Two Use Cases for High-Dimensional Outlier Detection

Techniques for detecting outliers in high-dimensional data often find two general use cases, regardless of whether they are looking for subspace outliers or contextual outliers.

Use Case 1: High-dimensional outlier detection for data exploration: One of the main characteristics of big data exploration tasks, in contrast to querying, is the fact that analysts do not necessarily know what they are looking for. Traditional outlier detection techniques are limited since they require users to specify the interesting attributes. Consider an analyst performing market research who wishes to determine which companies are unusually profitable. Since companies from different sectors may not be comparable in profits, the analyst might answer this query by running traditional outlier detection queries multiple times, each time on a subset of companies belonging to a specific sector (e.g., technology and media) to identify the most profitable companies within each sector. There are potentially a large number of interesting subgroups of which technology and media companies are only two possible subgroups. In addition, other grouping criteria (e.g., based on location) might also reveal outliers. Instead of relying on analysts to come up with subspaces and contexts, analysts need tools that can discover subspaces and contexts which contain outliers. In the previous example, while Company X might have “normal” reported profit compared to all technology companies—with normal defined, for example, as being within two standard deviations from the mean—it may have very unusual (outlying) profit when compared to companies with less than 10,000 employees. In this case, if a tool could produce [Business = “technology” л EmployeeCount < 1000], this may be an automatically discovered context of interest to the analyst.

Use Case 2: High-dimensional outlier detection for targeted explanation and diagnosis: Analysts often would like to answer the question “What is so special about this entity or record?,” a key question in explaining analytics results or diagnosing reported errors or anomalies. For example, an analyst performing troubleshooting at a call center may wish to understand why an important customer in industrial manufacturing is calling the troubleshooting hotline. Using conventional tools, the analyst might check whether a few of the customer’s key performance metrics (e.g., quality of service) are abnormal. However, this approach is brittle; for example, high-value clients may naturally exhibit deviations from the overall population. Instead, high-dimensional outlier analysis can take into account other dimensions with the performance metric dimensions to reveal the subspaces or the contexts in which the client is meaningfully outlying. For example, a tool might discover that the client is experiencing an unusually high rate of poor-quality service compared to users in his location using his hardware make and model.

Problem formulations for high-dimensional outlier detection generally fall into one of the two use cases, as we show in the following when we discuss in detail subspace outlier detection techniques and contextual outlier detection techniques.

2.5.2 Subspace Outliers

To better appreciate the challenges in detecting subspace outliers, consider the four different two-dimensional views of a hypothetical dataset in Figure 2.6 as an example [Aggarwal 2013]. We see that Point A is considered an outlier in View 1 and Point B is considered an outlier in View 4, whereas neither point is considered an outlier in View 2 and View 3. The example shows that different data points may be considered outliers in different subspaces. For datasets with high dimensionality, it is very likely that only a small fraction of all possible subspaces contains interesting outliers.

Figure 2.6 The outliers may be present or be lost in different subspaces of the full dimensions [Aggarwal 2013].

Detecting outliers in subspaces is a challenging problem mainly due to the fact that the number of possible subspaces is exponential with respect to the number of dimensions in the data. It is worth noting that there have been many proposals for finding clusters in subspaces [Parsons et al. 2004], and many techniques exploit the downward closure properties (a cluster is dense is subspace AB only if it is dense in subspace A and subspace B). Finding outliers in subspaces is even more challenging than finding clusters in subspaces. This is because outliers, by definition, are rare events in contrast to clusters, which are dense regions; hence, the downward closure property is usually not applicable to detecting outliers in subspaces. An effective outlier detection method needs to search the subspaces in such a way that the outliers are revealed as quickly as possible. In what follows, we present in detail a first algorithm [Aggarwal and Yu 2001] that achieves this goal, and we briefly discuss other proposals [Zhang et al. 2004, Muller et al. 2008, Kriegel et al. 2009].

Aggarwal and Yu [2001] discover outliers in subspaces by finding localized regions of the data in subspaces with abnormally low density. Data points contained in low density regions are identified as outliers. To determine abnormally low density regions in subspaces, a grid-based approach is used to perform a discretization of the data. Each dimension of the data is divided into ϕ ranges of equi-depth, and hence each range contains of the records. These ranges from one dimension form the localized regions in subspaces. Consider a k-dimensional region that is created by picking grid ranges from k different dimensions. The expected fraction of records in that region is fk, assuming that the attributes are statistically independent of each other. Of course, the attributes are far from statistically independent of each other, and thus the distribution of points in that region would differ significantly from the expected behavior. It is precisely those regions with abnormally low density that are useful for identifying outliers.

Formally, under the independence assumption, the presence of any data point in a k-dimensional region is a Bernoulli random variable with probability fk. Therefore, the expected fraction and standard deviation of data points in the k-dimensional region are N · fk and , respectively, where N is the total number of data points. Let n(D) be the actual number of data points in a k-dimensional region D. The sparsity coefficient S(D) of D is defined as follows:

Assuming n(D) fits a normal distribution, the normal distribution tables can be used to quantify the probabilistic level of significance of its deviation. Aggarwal and Yu [2001] aim to find regions with low S(D).

To avoid searching the exponentially large number of regions and computing S(D) for each region in a brute-force manner, evolutionary algorithms are used [Aggarwal and Yu 2001] to select regions with low S(D). In evolutionary methods, every solution or candidate to an optimization is treated as an individual in an evolutionary system. The “fitness” of an individual is exactly the objective function value of the corresponding solution. An evolutionary algorithm uses mechanisms inspired by biological evolution, including mutation, recombination, and selection. Accordingly, the candidate solutions to an optimization problem are mutated, recombined, and selected according to the objective function until a convergence criterion is met. Every candidate solution is usually encoded as a string. Algorithm 2.3 describes the procedure for detecting outliers in a high-dimensional dataset using the evolutionary method. For a d-dimensional dataset, the encoding string will be of length d and contain k specified positions, where k is a user provided input specifying the number of dimensions of interest. The fitness for the corresponding solution is computed using the sparsity coefficient S(D). The evolutionary search procedure starts with a population of p randomly selected solutions and iteratively uses the process of selection, crossover, and mutation until a convergence criterion is met (e.g., after a maximum number of iterations). The selection process works by ranking current solutions according to the S(D), and select higher ranked solution with a higher probability; the crossover process uses a two-point crossover procedure. At the end of the evolutionary algorithm, all data points contained in the final solutions are reported as outliers.

Algorithm 2.3 Evolutionary algorithm for finding outliers in high-dimensional data

Example 2.10 Consider a dataset with d = 5 dimensions in total and ϕ = 10 ranges for each dimension, and we are interested in subspaces with k = 3 dimensions. Every candidate solution will be encoded using a string of length 5, and every position in a string represents which range is selected for every dimension. For instance, a candidate solution, encoded as 3*2*1, represents a region with three dimensions, where the first dimension takes the third range, the third dimension takes the second range, and the fifth dimension takes the first range.

Performing a two-point crossover between two solutions 3*2*1 and 1*33* after the third position will result in two new solutions 3*23* and 1*3*1; and the mutation process randomly flips a position to a number between 1 to φ with a predefined mutation probability.

Algorithm 2.3 is an example of the first use case (cf. Section 2.5.1). OutRank [Muller et al. 2008] is another example of the first use case, which relies on subspace clustering approaches to find dense clusters in subspaces [Assent et al. 2007, Assent et al. 2008] and treats data points that are not in those dense clusters as outliers. Clusters are frequent objects (as opposed to outliers), and hence are amenable to the downward closure property. The outlyingness of an object is calculated based on how often the object is part of dense clusters, the dimensionality of the dense clusters, and the size of the dense clusters. In OutRank, outliers are side-products of a subspace clustering algorithm, which may not be satisfactory. For example, in a dataset where no subspace clusters are present, all objects may be reported as subspace outliers.

Instead of searching for relevant subspaces first and then finding outliers within those subspaces, HOS-Miner [Zhang et al. 2004] and SOD [Kriegel et al. 2009] are designed to find subspaces in which a given data point is an outlier. For a data point x, HOS-Miner aims at finding all subspaces such that the sum of its k-nearest neighbor distances in that subspace is at least δ. This approach does not normalize the distances with the number of dimensions, and hence a subspace with more dimensions is more likely to be in the output. One advantage of HOS-Miner is that the outlier definition it adopts exhibits the downward closure property—any subspace of a non-outlying subspace is also not outlying and every superset of an outlying subspace is also outlying. Only minimal subspaces, in which the given data point χ is an outlier, are interesting. HOS-Miner users an X-Tree to perform k-nearest neighbor searches efficiently. HOS-Miner also uses the fixed threshold δ to discern outliers in all subspaces, which could be problematic since these scores are rather incomparable in different subspaces. SOD [Kriegel et al. 2009] also aims at finding subspaces in which a given data point x is an outlier. Given the data point x, a set of reference data points S (x) are determined, which represent the proximity of the current data point x. Based on S (x), the relevant subspace for S (x) is determined as the set of dimensions in which the variance is small among points in S (x). If the query point x deviates considerably from the S(x) in the relevant subspace, then x is said to be an outlier in that subspace.

While subspace outlier detection seems a viable approach for detecting outliers in high-dimensional data, it still faces many challenges: (1) In spite of recent advances, the combinatorial nature of the problem requires more efficient search procedures of possible subspaces. (2) Almost all the current techniques adopt their own definitions of outliers in subspaces mainly for the purpose of effective enumeration or search of potential subspaces. For example, the sparsity coefficient is used as the fitness function in the evolutionary algorithm [Aggarwal and Yu 2001], and HOS-Miner [Zhang et al. 2004] is able to exploit the downward closure property based on its unique definition of subspace outliers. How to efficiently search possible subspaces under a general definition of multivariate outliers remains an open question. (3) A point might be considered an outlier in different subspaces. Therefore, one may combine results from these (incomparable) subspaces and rank subspace outliers effectively. This can also be seen as an opportunity since results from multiple subspaces may indicate more robust outliers.

2.5.3 Contextual Outlier Detection

The notion of contextual outliers was first studied in the context of time-series data [Weigend et al. 1995, Salvador et al. 2004] and spatial data [Shekhar et al. 2001, Kou et al. 2006], where the contextual attribute is the time dimension or the spatial dimension. For example, a certain temperature might not be considered high through the year, but is considered high for the winter months. Recently, many proposals have been made on discovering contextual outliers with general environmental attributes; some target the data exploration use case [Wei et al. 2003, Song et al. 2007, Tang et al. 2013, Liang and Parthasarathy 2016] (cf. Section 2.5.1), whereas some target the explanation and diagnosis use case [Angiulli et al. 2009, Zhu et al. 2004]. For proposals targeting the data exploration case, some assume that the environmental attributes and the behavioral attributes are given as input [Song et al. 2007, Liang and Parthasarathy 2016], while others explore the space of possible environmental and behavioral attributes to identify contextual outliers [Wei et al. 2003, Tang et al. 2013]. The goal of the proposals targeting the explanation and diagnosis use case is to discover the environmental attributes and the behavioral attributes in which a given object is an outlier. In what follows, we give an example algorithm for when the environmental attributes and the behavioral attributes are given [Song et al. 2007], and an example algorithm for when they are not [Wei et al. 2003].

Song et al. [2007] learn a correlation between the provided environmental attributes and the provided behavioral attributes, and define contextual outliers as those data points that deviate from the learned correlation. The correlation is modeled using a Gaussian mixture model, and the parameters of the model are learned using expectation-maximization methods. Example 2.11 shows an application where the correlation between the environmental attributes and the behavioral attributes can be used to detect interesting contextual outliers.

Figure 2.7 Conditional anomaly detection given the environmental attribute max_daily_temp and the behavioral attribute num_fever [Song et al. 2007].

Example 2.11 Consider the application of detecting a disease outbreak using a dataset with two dimensions: max_daily_temp, which denotes the maximum outside temperature on a given day, and num_fever, which denotes how many people are admitted to an emergency room due to high fever on a given day. Clearly, max_daily_temp is not a direct indicator of disease outbreak; however, it should be taken into account as an environmental attribute when detecting outliers, as max_daily_temp directly affects num_fever.

A dataset is shown in Figure 2.7. Point A is considered as an outlier by most outlier detection methods if we consider num_fever alone, since it has an abnormally high number for num_fever. However, Point A is not interesting for monitoring disease outbreaks, since it is expected that the num_fever will be high on a cold day, as is Point A.

Point B will not usually be considered as an outlier if we consider num_fever alone. However, it is an interesting outlier for monitoring disease outbreak, since it has abnormally high numfever for the max_daily_temp it has.

The HOT algorithm proposed by Wei et al. [2003] and the COD algorithm proposed by Tang et al. [2013] have explored contextual outliers on categorical datasets, where the environmental attributes and the behavioral attributes are not given. Both approaches consist of two steps: context enumeration and detecting contextual outliers within each enumerated context. Contexts in both approaches are essentially conjunctions of predicates, namely, attribute-value pairs, and they are enumerated using a lattice data structure. An example candidate context is A = a₁ ∧ B = b₁, which contains all the tuples that have value a₁ for attribute A and have value b₁ for attribute B. The differences between these two approaches lie in the space of contexts they are interested in and how outliers are defined in a certain context. Algorithm 2.4 gives the details of the HOT algorithm as a concrete example. First, all contexts are enumerated and frequent contexts are mined using the Apriori algorithm [Agrawal and Srikant 1994]. For each frequent context C, the histogram of frequencies associated with each attribute A not in C is stored, which is then used to compute the deviation of each value taken by A in the database. Finally, the objects assuming a value on the attribute A whose deviation is smaller than a user-provided threshold are returned as outliers. Example 2.12 shows an example contextual outlier detected by Algorithm 2.4.

Algorithm 2.4 The HOT algorithm for discovering contextual outliers

Example 2.12 Consider Table 2.2 with four attributes, Name, AgeRange, CarType, and SalaryLevel, and ten tuples. An example frequent context is C : AgeRange = ‘Young′ with five tuples, namely, t₃, t₅, t₆, t₈, t₁₀. The attribute CarAttribute is not in C, and is therefore a potential behavioral attribute. There are two values for CarAttribute in the five tuples; the value “Sedan” has a frequency of 1, and the value “Sports” has a frequency of 4. Therefore, the value “Sedan” in tuple t₃ is considered as an outlying value in context C : AgeRange = ‘Young′.

Table 2.2 A example table for contextual outlier detection with categorical attributes

Despite recent advances of contextual outlier detection, it still faces many challenges. (1) Most current contextual outlier detection techniques assume the attributes in the data to be either entirely categorical [Wei et al. 2003, Angiulli et al. 2009, Tang et al. 2013] or entirely numerical [Song et al. 2007, Liang and Parthasarathy 2016]. It remains an open question how to handle mixed types of attributes for contextual outlier detection. (2) Context enumeration is usually expensive to process, and is exponential with respect to the number of environmental attributes. Current techniques treat context enumeration and detecting outliers in every context as two separate processes with little interaction [Wei et al. 2003, Tang et al. 2013]. One future direction is to interleave the two processes to avoid enumerating contexts that do not have outliers without actually running an outlier detection method in them. (3) A data point might be considered to be a contextual outlier by treating different subsets of attributes as environmental attributes or behavioral attributes. In most scenarios, users must investigate outliers reported by an automatic technique. Therefore, we need techniques to filter, rank, and report different contextual outliers discovered so most interesting outliers will be examined first.

2.6 Conclusion

Outlier detection is a rich topic that has been studied within several communities and application domains, and there exist several surveys and books on this topic [Barnett and Lewis 1994, Hellerstein 2008, Chandola et al. 2009, Aggarwal 2013]. This chapter is complementary to those surveys. Our goal is: (1) to provide a classification of major types of outlier detection techniques, namely, statistics-based methods, distance-based methods, and model-based methods; and (2) to discuss outlier detection in high dimensional data.

Statistics-based outlier detection techniques assume that normal data points would appear in high probability regions of a stochastic model, while outliers would occur in low probability regions of a stochastic model. We distinguish between two different types of statistics-based approaches: the hypothesis testing approach and the distribution fitting approach. The hypothesis testing approach usually calculates a test statistic based on observed data points to determine whether the null hypothesis (there is no outlier in the dataset) should be rejected. The distribution fitting approach, as the name suggests, fits a pdf to the observed data and marks as outliers those points with low probability according to the pdf. Distance-based outlier detection techniques are based on a definition of distances between data points. In general, normal points should be close to each other, and outlying points should be distant from normal points. Distance-based methods can be further divided into global methods and local methods depending on the reference population used when determining whether a point is an outlier. Model-based approaches follow the classic ML paradigm to learn one or more classifiers to distinguish between normal points and outliers. Model-based approaches are dependent on the availability of labeled training data. We further discuss techniques for handling high dimensional data, including dimensionality reduction, subspace outlier detection, and contextual outlier detection.

As discussed, outlier detection techniques define “normal” differently and also make different assumptions about the underlying data set. It is important for practitioners to choose the outlier detection methods that best match their use case. If probabilistic interpretations of results are needed, then statistics-based techniques provide a formal framework for statistical reasoning. If the underlying data follows a known distribution, such as normal distribution, then techniques such as Grubbs’ test or z-score are applicable, as shown in Section 2.2.2; otherwise, KDE, as shown in Section 2.2.4, provides a nonparametric way for estimating the probability distribution, which can then be used for outlier detection. If probabilistic interpretations of results are not needed or hard to know, then distance-based (cf. Section 2.3) and model-based techniques (cf. Section 2.4) provide alternative approaches. Distance-based approaches usually require the user to specify a distance parameter or threshold, which may involve a trial and error process. Model-based approaches usually require labeled training data to train supervised ML models. Unfortunately, there is no gold standard or dominant outlier detection method. Our taxonomy provides a good reference for navigating the vast literature of outlier detection.

Подняться наверх