Читать книгу Data Science For Dummies - Lillian Pierson - Страница 73
Decreasing dimensionality and removing outliers with PCA
ОглавлениеPrincipal component analysis (PCA) is another dimensionality reduction technique that’s closely related to SVD: This unsupervised statistical method finds relationships between features in your dataset and then transforms and reduces them to a set of non-information-redundant principal components — uncorrelated features that embody and explain the information that’s contained within the dataset (that is, its variance). These components act as a synthetic, refined representation of the dataset, with the information redundancy, noise, and outliers stripped out. You can then use those reduced components as input for your machine learning algorithms to make predictions based on a compressed representation of your data. (For more on outliers, see the “Detecting Outliers” section, later in this chapter.)
The PCA model makes these two assumptions:
Multivariate normality (MVN) — or a set of real-valued, correlated, random variables that are each clustered around a mean — is desirable, but not required.
Variables in the dataset should be continuous.
Although PCA is like factor analysis, they have two major differences: One difference is that PCA does not regress to find some underlying cause of shared variance, but instead decomposes a dataset to succinctly represent its most important information in a reduced number of features. The other key difference is that, with PCA, the first time you run the model, you don’t specify the number of components to be discovered in the dataset. You let the initial model results tell you how many components to keep, and then you rerun the analysis to extract those features.
Similar to the CVE discussion in the SVD part of this chapter, the amount of variance you retain depends on how you’re applying PCA, as well as the data you’re inputting into the model. Breaking it down based on how you’re applying PCA, the following rules of thumb become relevant:
Used for descriptive analytics: If PCA is being used for descriptive purposes only (for example, when working to build a descriptive avatar of your company’s ideal customer) the CVE can be lower than 95 percent. In this case you can get away with a CVE as low as 75-80 percent.
Used for diagnostic, predictive or prescriptive analytics: If principal components are meant for downstream models that generate diagnostic, predictive or prescriptive analytics, then CVE should be 95 percent or higher. Just realize that the lower the CVE, the less reliable your model results will be downstream. Each percentage of CVE that’s lost represents a small amount of information from your original dataset that won’t be captured by the principal components.
When using PCA for outlier detection, simply plot the principal components on an x-y scatter plot and visually inspect for areas that might have outliers. Those data points correspond to potential outliers that are worth investigating.