Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 33
Design Example 2.4
ОглавлениеA slightly more sophisticated tree for predicting email spam is shown in Figure 2.17. The results for data from 4601 email messages have been presented in [12].
Figure 2.17 Predicting email spam.
Source: Trevor Hastie [12].
Splitting criteria: In most of the cases, the discrete splitting functions are univariate. Univariate means that an internal node is split according to the value of a single attribute. Consequently, the inducer searches for the best attribute upon which to split. There are various univariate criteria. These criteria can be characterized in different ways, such as according to the origin of the measure (information theory, dependence, and distance) and according to the measure structure (impurity‐based criteria, normalized impurity‐based criteria, and binary criteria). Next, we describe the most common criteria in the literature.
Figure 2.18 Top‐down algorithmic framework for decision tree induction. The inputs are S (training set), A (input feature set), and y (target feature) [8].
Impurity‐based criteria: Given a random variable with k discrete values, distributed according to P = (p1, p2, …, pk), an impurity measure is a function ϕ: [0, 1]k → R that satisfies the following conditions: ϕ(P) ≥ 0; ϕ(P) is minimum if ∃i such that component Pi = 1; ϕ(P) is maximum if ∀i, 1 ≤ i ≤ k, Pi = 1/k; ϕ(P) is symmetric with respect to components of P; φ(P) is differentiable everywhere in its range. If the probability vector has a component of 1 (the variable x gets only one value), then the variable is defined as pure. On the other hand, if all components are equal the level of impurity reaches a maximum. Given a training set S, the probability vector of the target attribute y is defined as
(2.12)
The goodness of split due to the discrete attribute ai is defined as a reduction in impurity of the target attribute after partitioning S according to the values vi, j ∈ dom(ai):
(2.13)
Information gain (ig) is an impurity‐based criterion that uses the entropy (e) measure (origin from information theory) as the impurity measure:
where
(2.14)
Gini index: This is an impurity‐based criterion that measures the divergence between the probability distributions of the target attribute’s values. The Gini (G) index is defined as
(2.15)
Consequently, the evaluation criterion for selecting the attribute ai is defined as the Gini gain (GG):
(2.16)
Likelihood ratio chi‐squared statistics: The likelihood ratio (lr) is defined as
(2.17)
This ratio is useful for measuring the statistical significance of the information gain criteria. The zero hypothesis (H0) is that the input attribute and the target attribute are conditionally independent. If H0 holds, the test statistic is distributed as χ2 with degrees of freedom equal to (dom(ai) − 1) · (dom(y) − 1).
Normalized impurity‐based criterion: The impurity‐based criterion described above is biased toward attributes with larger domain values. That is, it prefers input attributes with many values over attributes with less values. For instance, an input attribute that represents the national security number will probably get the highest information gain. However, adding this attribute to a decision tree will result in a poor generalized accuracy. For that reason, it is useful to “normalize” the impurity‐based measures, as described in the subsequent paragraphs.
Gain ratio ( gr): This ratio “normalizes” the information gain (ig) as follows: gr(ai, S) = ig(ai, S)/e(ai, S). Note that this ratio is not defined when the denominator is zero. Also, the ratio may tend to favor attributes for which the denominator is very small. Consequently, it is suggested in two stages. First, the information gain is calculated for all attributes. Then, taking into consideration only attributes that have performed at least as well as the average information gain, the attribute that has obtained the best ratio gain is selected. It has been shown that the gain ratio tends to outperform simple information gain criteria both from the accuracy aspect as well as from classifier complexity aspect.
Distance measure: Similar to the gain ratio, this measure also normalizes the impurity measure. However, the method used is different:
where
(2.18)
Binary criteria: These are used for creating binary decision trees. These measures are based on the division of the input attribute domain into two subdomains.
Let β(ai, d1, d2, S) denote the binary criterion value for attribute ai over sample S when d1 and d2 are its corresponding subdomains. The value obtained for the optimal division of the attribute domain into two mutually exclusive and exhaustive subdomains, is used for comparing attributes, namely
(2.19)
Twoing criterion: The Gini index may encounter problems when the domain of the target attribute is relatively wide. In this case, they suggest using the binary criterion called the twoing (tw) criterion. This criterion is defined as
(2.20)
When the target attribute is binary, the Gini and twoing criteria are equivalent. For multiclass problems, the twoing criterion prefers attributes with evenly divided splits.
Orthogonality (ort) criterion: This binary criterion is defined as
where θ(Py,1, Py,2) is the angle between two distribution vectors Py,1 and Py,2 of the target attribute y on the bags and , respectively. It was shown that this criterion performs better than the information gain and the Gini index for specific problem constellations.
Kolmogorov–Smirnov (K–S) criterion: Assuming a binary target attribute, namely, dom(y) = {c1, c2}, the criterion is defined as
(2.21)
It was suggest extending this measure to handle target attributes with multiple classes and missing data values. The results indicate that the suggested method outperforms the gain ratio criteria.
Stopping criteria: The tree splitting (growing phase) continues until a stopping criterion is triggered. The following conditions are common stopping rules: (i) all instances in the training set belong to a single value of y, (ii) the maximum tree depth has been reached, (iii) the number of cases in the terminal node is less than the minimum number of cases for the parent nodes, and (iv) if the node were split, the number of cases in one or more child nodes would be less than the minimum number of cases for child nodes. The best splitting criteria is not greater than a certain threshold.
Pruning methods: Employing tightly stopping criteria tends to create small and underfitted decision trees. On the other hand, using loosely stopping criteria tends to generate large decision trees that are overfitted to the training set. Pruning methods were developed to solve this dilemma. There are various techniques for pruning decision trees. Most of them perform top‐down or bottom‐up traversal of the nodes. A node is pruned if this operation improves a certain criterion. Next, we describe the most popular techniques.
Cost‐complexity pruning (pr): This proceeds in two stages. In the first stage, a sequence of trees T0, T1, … Tk is built on the training data, where T0 is the original tree before pruning and Tk is the root tree. In the second stage, one of these trees is chosen as the pruned tree, based on its generalization error estimation. The tree Ti + 1 is obtained by replacing one or more of the subtrees in the predecessor tree Ti with suitable leaves. The subtrees that are pruned are those that obtain the lowest increase in apparent error rate per pruned leaf (l):
(2.22)
where ε(T, S) indicates the error rate of the tree T over the sample S, and ∣l(T)∣ denotes the number of leaves in T. The parameter pr(T, t) denote the tree obtained by replacing the node t in T with a suitable leaf. In the second phase, the generalization error of each pruned tree T0, T1, … Tk is estimated. The best pruned tree is then selected. If the given dataset is large enough, it is suggested to break it into a training set and a pruning set. The trees are constructed using the training set and evaluated on the pruning set. On the other hand, if the given dataset is not large enough, the cross‐validation methodology is suggested, despite the computational complexity implications.
Reduced error pruning: While traversing the internal nodes from the bottom to the top, the procedure checks, for each internal node, whether replacing it with the most frequent class reduces the tree’s accuracy. If not, the node is pruned. The procedure continues until any further pruning would decrease the accuracy. It can be shown that this procedure ends with the smallest accurate subtree with respect to a given pruning set.
Minimum‐error pruning (MEP): It performs bottom‐up traversal of the internal nodes. In each node, it compares the l‐probability‐error rate estimation with and without pruning. The l‐probability‐error rate estimation is a correction to the simple probability estimation using frequencies. If St denote the instances that have reached node t, then the error rate obtained if this node was pruned is
(2.23)
where papr(y = ci) is the a priori probability of y getting the value ci, and denotes the weight given to the a priori probability. A node is pruned if it does not increase the m probability‐error rate.
Pessimistic pruning: The basic idea is that the error ratio estimated using the training set is not reliable enough. Instead, a more realistic measure known as continuity correction for binomial distribution should be used: ε′(T, S) = ε(T, S) + ∣ l(T) ∣ /2 · ∣ S∣. However, this correction still produces an optimistic error rate. Consequently, it was suggested to prune an internal node t if its error rate is within one standard error from a reference tree, namely
(2.24)
The last condition is based on the statistical confidence interval for proportions. Usually, the last condition is used such that T refers to a subtree whose root is the internal node t, and S denotes the portion of the training set that refers to the node. The pessimistic pruning procedure performs top‐down traversing over the internal nodes. If an internal node is pruned, then all its descendants are removed from the pruning process, resulting in a relatively fast pruning.
Error‐based pruning (ebp): This is an evolution of pessimistic pruning. As in pessimistic pruning‚ the error rate is estimated using the upper bound of the statistical confidence interval for proportions:
(2.25)
where ε(T, S) denotes the misclassification rate of the tree T on the training set S. Z is the inverse of the standard normal cumulative distribution, and α is the desired significance level. Let subtree (T, t) denote the subtree rooted by the node t. Let maxchild (T, t) denote the most frequent child node of t (namely, most of the instances in S reach this particular child), and let St denote all instances in S that reach the node. The procedure performs bottom‐up traversal over all nodes and compares the following values:
According to the lowest value, the procedure either leaves the tree as is, prunes the node, or replaces the node t with the subtree rooted by maxchild (T, t).
Optimal pruning (opt): Bohanec and Bratko [17] introduced an algorithm guaranteeing optimality called optimal pruning (opt). This algorithm finds the optimal pruning based on dynamic programming, with a complexity of θ(∣l(T)|2), where T is the initial decision tree. Almuallim [18] introduced an improvement of opt called opt‐2, which also performs optimal pruning using dynamic programming. However, the time and space complexities of opt‐2 are both Θ( ∣ l(T*) ∣ ∣internal (T) ∣ ), where T * is the target (pruned) decision tree, and T is the initial decision tree.
Since the pruned tree is usually much smaller than the initial tree and the number of internal nodes is smaller than the number of leaves, opt‐2 is usually more efficient than opt in terms of computational complexity.
Minimum description length (MDL) pruning: Rissanen [19], Quinlan and Rivest [20], and Mehta et al. [21] used the MDL to evaluate the generalized accuracy of a node. This method measures the size of a decision tree by means of the number of bits required to encode the tree. The MDL method prefers decision trees that can be encoded with fewer bits. Mehta et al. [21] indicate that the cost of a split at a leaf t can be estimated as
(2.26)
where ∣St∣ denote the number of instances that have reached the node.
The splitting cost of an internal node is calculated based on the cost aggregation of its children.