Читать книгу Profit Driven Business Analytics - Baesens Bart - Страница 16
CHAPTER 2
Analytical Techniques
DATA PREPROCESSING
ОглавлениеData are the key ingredient for any analytical exercise. Hence, it is important to thoroughly consider and gather all data sources that are potentially of interest and relevant before starting the analysis. Large experiments as well as a broad experience in different fields indicate that when it comes to data, bigger is better. However, real life data can be (typically are) dirty because of inconsistencies, incompleteness, duplication, merging, and many other problems. Hence, throughout the analytical modeling steps, various data preprocessing checks are applied to clean up and reduce the data to a manageable and relevant size. Worth mentioning here is the garbage in, garbage out (GIGO) principle that essentially states that messy data will yield messy analytical models. Hence, it is of utmost importance that every data preprocessing step is carefully justified, carried out, validated, and documented before proceeding with further analysis. Even the slightest mistake can make the data totally unusable for further analysis, and completely invalidate the results. In what follows, we briefly zoom into some of the most important data preprocessing activities.
Denormalizing Data for Analysis
The application of analytics typically requires or presumes the data to be presented in a single table, containing and representing all the data in some structured way. A structured data table allows straightforward processing and analysis, as briefly discussed in Chapter 1. Typically, the rows of a data table represent the basic entities to which the analysis applies (e.g., customers, transactions, firms, claims, or cases). The rows are also referred to as observations, instances, records, or lines. The columns in the data table contain information about the basic entities. Plenty of synonyms are used to denote the columns of the data table, such as (explanatory or predictor) variables, inputs, fields, characteristics, attributes, indicators, and features, among others. In this book, we will consistently use the terms observation and variable.
Several normalized source data tables have to be merged in order to construct the aggregated, denormalized data table. Merging tables involves selecting information from different tables related to an individual entity, and copying it to the aggregated data table. The individual entity can be recognized and selected in the different tables by making use of (primary) keys, which are attributes that have specifically been included in the table to allow identifying and relating observations from different source tables pertaining to the same entity. Figure 2.1 illustrates the process of merging two tables – that is, transaction data and customer data – into a single, non-normalized data table by making use of the key attribute ID, which allows connecting observations in the transactions table with observations in the customer table. The same approach can be followed to merge as many tables as required, but clearly the more tables are merged, the more duplicate data might be included in the resulting table. It is crucial that no errors are introduced during this process, so some checks should be applied to control the resulting table and to make sure that all information is correctly integrated.
Figure 2.1 Aggregating normalized data tables into a non-normalized data table.
Sampling
The aim of sampling is to take a subset of historical data (e.g., past transactions), and use that to build an analytical model. A first obvious question that comes to mind concerns the need for sampling. Obviously, with the availability of high performance computing facilities (e.g., grid and cloud computing), one could also try to directly analyze the full dataset. However, a key requirement for a good sample is that it should be representative for the future entities on which the analytical model will be run. Hence, the timing aspect becomes important since, for instance, transactions of today are more similar to transactions of tomorrow than they are to transactions of yesterday. Choosing the optimal time window of the sample involves a trade-off between lots of data (and hence a more robust analytical model) and recent data (which may be more representative). The sample should also be taken from an average business period to get as accurate as possible a picture of the target population.
Exploratory Analysis
Exploratory analysis is a very important part of getting to know your data in an “informal” way. It allows gaining some initial insights into the data, which can then be usefully adopted throughout the analytical modeling stage. Different plots/graphs can be useful here such as bar charts, pie charts, and scatter plots, for example. A next step is to summarize the data by using some descriptive statistics, which all summarize or provide information with respect to a particular characteristic of the data. Hence, they should be assessed together (i.e., in support and completion of each other). Basic descriptive statistics are the mean and median values of continuous variables, with the median value less sensitive to extreme values but then, as well, not providing as much information with respect to the full distribution. Complementary to the mean value, the variation or the standard deviation provide insight with respect to how much the data are spread around the mean value. Likewise, percentile values such as the 10th, 25th, 75th, and 90th percentile provide further information with respect to the distribution and as a complement to the median value. For categorical variables, other measures need to be considered such as the mode or most frequently occurring value.
Missing Values
Missing values (see Table 2.1) can occur for various reasons. The information can be nonapplicable – for example, when modeling the amount of fraud, this information is only available for the fraudulent accounts and not for the nonfraudulent accounts since it is not applicable there (Baesens et al. 2015). The information can also be undisclosed. For example, a customer decided not to disclose his or her income because of privacy. Missing data can also originate because of an error during merging (e.g., typos in name or ID). Missing values can be very meaningful from an analytical perspective since they may indicate a particular pattern. As an example, a missing value for income could imply unemployment, which may be related to, for example, default or churn. Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other techniques need some additional preprocessing. Popular missing value handling schemes are removal of the observation or variable, and replacement (e.g., by the mean/median for continuous variables and by the mode for categorical variables).
Table 2.1 Missing Values in a Dataset
Outlier Detection and Handling
Outliers are extreme observations that are very dissimilar to the rest of the population. Two types of outliers can be considered: valid observations (e.g., salary of boss is €1.000.000) and invalid observations (e.g., age is 300 years). Two important steps in dealing with outliers are detection and treatment. A first obvious check for outliers is to calculate the minimum and maximum values for each of the data elements. Various graphical tools can also be used to detect outliers, such as histograms, box plots, and scatter plots. Some analytical techniques (e.g., decision trees) are fairly robust with respect to outliers. Others (e.g., linear/logistic regression) are more sensitive to them. Various schemes exist to deal with outliers; these are highly dependent on whether the outlier represents a valid or an invalid observation. For invalid observations (e.g., age is 300 years), one could treat the outlier as a missing value by using any of the schemes (i.e., removal or replacement) mentioned in the previous section. For valid observations (e.g., income is €1,000,000), other schemes are needed such as capping whereby lower and upper limits are defined for each data element.
Principal Component Analysis
A popular technique for reducing dimensionality, studying linear correlations, and visualizing complex datasets is principal component analysis (PCA). This technique has been known since the beginning of the last century (Jolliffe 2002), and it is based on the concept of constructing an uncorrelated, orthogonal basis of the original dataset.
Throughout this section, we will assume that the observation matrix X is normalized to zero mean, so that . We do this so the covariance matrix of X is exactly equal to XTX. In case the matrix is not normalized, then the only consequence is that the calculations have an extra (constant) term, so assuming a centered dataset will simplify the analyses.
The idea for PCA is simple: is it possible to engulf our data in an ellipsoid? If so, what would that ellipsoid look like? We would like four properties to hold:
1. Each principal component should capture as much variance as possible.
2. The variance that each principal component captures should decrease in each step.
3. The transformation should respect the distances between the observations and the angles that they form (i.e., should be orthogonal).
4. The coordinates should not be correlated with each other.
The answer to these questions lies in the eigenvectors and eigenvalues of the data matrix. The orthogonal basis of a matrix is the set of eigenvectors (coordinates) so that each one is orthogonal to each other, or, from a statistical point of view, uncorrelated with each other. The order of the components comes from a property of the covariance matrix XTX: if the eigenvectors are ordered by the eigenvalues of XTX, then the highest eigenvalue will be associated with the coordinate that represents the most variance. Another interesting property of the eigenvalues and the eigenvectors, proven below, is that the eigenvalues of XTX are equal to the square of the eigenvalues of X, and that the eigenvectors of X and XTX are the same. This will simplify our analyses, as finding the orthogonal basis of X will be the same as finding the orthogonal basis of XTX.
The principal component transformation of X will then calculate a new matrix P from the eigenvectors of X (or XTX). If V is the matrix with the eigenvectors of X, then the transformation will calculate a new matrix . The question is how to calculate this orthogonal basis in an efficient way.
The singular value decomposition (SVD) of the original dataset X is the most efficient method of obtaining its principal components. The idea of the SVD is to decompose the dataset (matrix) X into a set of three matrices, U, D, and V, such that , where VT is the transpose of the matrix V1, and U and V are unitary matrices, so . The matrix D is a diagonal matrix so that each element di is the singular value of matrix X.
Now we can calculate the principal component transformation P of X. If , then . But, from we can calculate the expression , and identifying terms we can see that matrix V is composed by the eigenvectors of XTX, which are equal to the eigenvectors of X, and the eigenvalues of X will be equal to the square root of the eigenvalues of XTX, D2, as we previously stated. Thus, , with D the eigenvalues of X and U the eigenvectors, or left singular vectors, of X.
Конец ознакомительного фрагмента. Купить книгу
1
In detail, we would like the matrices U and V to be unitary, that is, that the inverse of the matrix is its conjugate transpose. The conjugate transform of a matrix A is such that , with ā the complex conjugate of a. If all elements aij of matrix A are real, then the transpose of matrix A.