Читать книгу Applied Modeling Techniques and Data Analysis 2 - Группа авторов - Страница 14

1.2. Materials and methods 1.2.1. Data

Оглавление

Data on hand refers to a sample of 8,028 audited self-employed individuals for fiscal year 2012, each described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.3

Just for descriptive purposes, we can depict the statistical distribution of the revenues achieved by the businesses in our sample, grouped in classes (in thousands of euros), in Figure 1.1.

Most of our dataset is made up of small-sized taxpayers, of which almost 50% show revenues lower than € 75,000 per year and only 4% higher than € 500,000, with a sample average of € 146,348.


Figure 1.1. Revenues distribution

For each taxpayer in the dataset, both his tax notice status and the additional due taxes (i.e. the additional requested tax amount) are known.

Here comes the first problem that needs to be tackled: the additional due tax is a numeric attribute which measures the seriousness of the taxpayer’s tax evasion, whereas our algorithms, as we will show later on, need categorical values in order to predict. Thus, we cannot directly use the additional due taxes, but we need to define a class variable and decide both which values it will take and how to map each numeric value referred to the additional due taxes into such categorical values.

Applied Modeling Techniques and Data Analysis 2

Подняться наверх