Читать книгу Applied Modeling Techniques and Data Analysis 2 - Группа авторов - Страница 17
1.2.4. The models
ОглавлениеOur selection strategy needs to take into account two competing demands: on one hand, tax notices must be profitable, i.e. they have to address serious tax fraud or the tax evasion phenomena; on the other, tax collectability must be guaranteed in order to justify all of the tax authorities’ efforts.
To this purpose, we develop two models, both in the form of classification trees: the first one predicts whether a taxpayer is interesting or not, while the second predicts the final stage of a tax notice, distinguishing between those ending with an enforced recovery proceeding and the others, where such enforced recovery proceedings do not take place.
The first one’s attributes are taken from several datasets run by the IRA and are related to the taxpayers’ tax returns and their annexes (such as the sector studies), their properties details, their customers and suppliers lists and their tax notices, whereas the second one only focuses on a set of features concerning taxpayers’ assets.
In the taxpayer selection process, models that are easier to interpret are preferred to more complex models. Typically, decision trees meet the above requested conditions, so both of our models take that form.
In both cases, instead of considering just one decision tree, both practical and theoretical reasons (Breiman 1996) lead us towards a more sophisticated technique, known as bagging, which stands for bootstrap aggregating, with which many base classifiers are computed (in our case, many trees).
Moreover, a cost matrix is used while building the models. Indeed, in our context, to classify an actual not interesting taxpayer as interesting is a much more serious error than that of classifying as an actual interesting taxpayer as not interesting, based on the fact that, generally, tax offices’ human resources are barely sufficient to perform all of the audits they are assigned. Therefore, as long as offices audit interesting taxpayers, everything is fine, even though many interesting taxpayers may not be considered. In the same way, to predict that a tax notice will not end in a coercive procedure when it actually does, is a much more serious error than that of classifying a tax notice final stage the other way round. Therefore, different weights are given to different misclassification errors.
Finally, Ross Quinlan’s C4.5 decision tree algorithm is used to build the base classifiers within the bagging process.
Figure 1.5 puts all the pieces of our models together.
Figure 1.5. The two models together