Читать книгу Bioinformatics and Medical Applications - Группа авторов - Страница 21

1.3.3 Decision Tree

Оглавление

Decision Trees are amazing and well-known devices which are used for classification and forecasting. It is a tree based classifier wherein nodes represent a test on one attribute, leaves indicate the worth of the target attribute, edge represents split of 1 attribute and path is a dis junction of test to form the ultimate decision.

The current implementation offers two stages of impurity (Gini impurity and entropy) and one impurity measure for regression (variability). Gini’s impurity refers to the probability of a misdiagnosis of a replacement variate, if that condition is new organized randomly in accordance with the distribution of class labels from the information set. Bound by 0 occurs when data contains only one category. Gini Index is defined by the formula


Entropy is defined as


where pj is the proportion of samples that belong to class c for a specific node.

Gini impurity and entropy are used as selection criterion for decision trees. Basically, they assist us with figuring out what is a decent split point for root/decision nodes on classification/regression trees. Decision trees utilizes the split point to split on the feature resulting in the highest information gain (IG) for a given criteria which is referred to as Gini or entropy. It is based on the decrease in entropy after a dataset is split on an attribute. A number of the benefits of decision tree are as follows:

 • It requires less effort to process data while it is done in advance.

 • It does not require standardization and data scaling.

 • Intuitive and simple to clarify.

However, it has some disadvantages too, as follows:

 • Minor changes in the data can cause major structural changes leading to instability.

 • Sometimes math can be very difficult in some algorithms.

 • It usually involves more time for training.

 • It is very expensive as the complexity and time taken is too much.

 • Not adequate on regression and predicting continuous values.

Bioinformatics and Medical Applications

Подняться наверх