Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 17
2.1.3 Decision Tree: Regression Trees Versus Classification Trees
ОглавлениеThe decision tree (Figure 2.3) is a type of supervised learning algorithm (having a predefined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub‐populations) based on the most significant splitter/differentiator in the input variables.
Types of decision tree are based on the type of target variable used. If a categorical target variable (zero/one or yes/no) as described in the previous section is used, then we have a categorical variable decision tree. If a continuous target variable is used, then we have a continuous variable decision tree. The basic tree terminology is presented in Figure 2.4.
Figure 2.3 Decision tree.
Figure 2.4 Tree terminology.
Regression trees versus classification trees: From Figure 2.4 we can see that the terminal nodes (or leaves) lie at the bottom of the decision tree. This means that decision trees are typically drawn upside down such that leaves are the bottom and the roots are the top.
Both the trees work almost similar to each other. Let us look at the primary differences and similarities between classification and regression trees:
Regression trees are used when the dependent variable is continuous. Classification trees are used when dependent variable is categorical.
In the case of regression trees, the value obtained by terminal nodes in the training data is the mean response of observations falling in that region. Thus, if an unseen data observation falls in that region, we will make its prediction with the mean value.
In the case of classification trees, the value (class) obtained by terminal nodes in the training data is the mode of observations falling in that region. Thus, if an unseen data observation falls in that region, we will make its prediction with the mode value.
Both the trees divide the predictor space (independent variables) into distinct and non‐overlapping regions. For the sake of simplicity, we can think of these regions as high‐dimensional boxes.
Both the trees follow a top‐down greedy approach known as recursive binary splitting. We call it “top‐down” because it begins from the top of the tree when all the observations are available in a single region and successively splits the predictor space into two new branches down the tree. It is known as “greedy” because the algorithm cares about (looks for the best variable available) only the current split, and not about future splits which will lead to a better tree.
This splitting process is continued until a user‐defined stopping criterion is reached. For example, we can tell the algorithm to stop once the number of observations per node becomes less than 50.
In both the cases, the splitting process results in fully grown trees until the stopping criterion is reached. However, the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data. This is handled by “pruning,” which is one of the techniques used to tackle overfitting.
Tree pruning: One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks overfitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances, and then use pruning to remove nodes that do not provide additional information.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross‐validation set. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance. Pruning can occur in a top‐down or bottom‐up fashion. A top‐down pruning will traverse nodes and trim subtrees starting at the root, while a bottom‐up pruning will start at the leaf nodes. We now describe two popular pruning algorithms.
Reduced error pruning: One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected, then the change is retained. Although somewhat naive, reduced error pruning has the advantage of simplicity and speed.
Cost‐complexity pruning: Cost‐complexity pruning generates a series of trees T0 , T1 , …, Tm, where T0 is the initial tree and Tm is the root alone. At step i, the tree is created by removing a subtree from tree i − 1 and replacing it with a leaf node having a value chosen as in the tree building algorithm. The subtree that is removed is chosen as follows: (a) Define the error rate of tree T over dataset S as e(T,S). (b) The subtree chosen for removal minimizes
The function prune(T,t) defines the tree obtained by pruning the subtrees t from the tree T. Once the series of trees has been created, the best tree is chosen by generalized accuracy as measured by a training set or cross‐validation.
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criterion is different for classification and regression trees. The most popular algorithms used in practice are Gini, Chi‐Square, and Reduction in Variance. For details, see Section 2.2.2.
Overfitting in decision trees is one of the key challenges faced while using tree‐based algorithms. If no limit is set on the size of a decision tree, it will give you 100% accuracy on the training set because in the worst case it will end up making one leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision tree, and it can be done in two ways: setting constraints on tree size and tree pruning.
Setting constraints on tree‐based algorithms can be done by using various parameters that are used to define a tree, such as the following:
Minimum samples for a node split defines the minimum number of samples (or observations) that are required in a node to be considered for splitting. It is used to control overfitting. Higher values prevent a model from learning relations that might be highly specific to the particular sample selected for a tree. Excessively high values can lead to underfitting.
Minimum samples for a terminal node (leaf) defines the minimum samples (or observations) required in a terminal node or leaf. It is used to control overfitting in a way that is similar to min_samples_split [51]. In general, lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in a majority will be very small.
Maximum depth of tree (vertical depth) is used to control overfitting as a higher depth will allow the model to learn relations that are very specific to a particular sample.
Maximum number of terminal nodes can be defined in place of max_depth [52]. Since binary trees are created, a depth of “n” would produce a maximum of 2n leaves.
Maximum features to consider for split is the number of features to consider while searching for the best split. These will be randomly selected. As a rule of thumb, the square root of the total number of features works well, but up to 30–40% of the total number of features should be checked. Higher values can lead to overfitting, but this is case specific.