Читать книгу Data Science For Dummies - Lillian Pierson - Страница 72
Reducing dimensionality with factor analysis
ОглавлениеFactor analysis is along the same lines as SVD in that it’s a method you can use for filtering out redundant information and noise from your data. An offspring of the psychometrics field, this method was developed to help you derive a root cause in cases where a shared root cause results in shared variance — when a variable’s variance correlates with the variance of other variables in the dataset.
A variable's variability measures how much variance it has around its mean. The greater a variable’s variance, the more information that variable contains.
When you find shared variance in your dataset, that means information redundancy is at play. You can use factor analysis or principal component analysis to clear your data of this information redundancy. You see more on principal component analysis in the following section, but for now, focus on factor analysis and the fact that you can use it to compress your dataset’s information into a reduced set of meaningful, non-information-redundant latent variables — meaningful inferred variables that underlie a dataset but are not directly observable.
Factor analysis makes the following assumptions:
Your features are metric — numeric variables on which meaningful calculations can be made.
Your features should be continuous or ordinal (if you’re not sure what ordinal is, refer back to the first class, business class, and economy class analogy in the probability distributions section of this chapter).
You have more than 100 observations in your dataset and at least 5 observations per feature.
Your sample is homogenous.
There is r > 0.3 correlation between the features in your dataset.
In factor analysis, you do a regression — a topic covered later in this chapter — on features to uncover underlying latent variables, or factors. You can then use those factors as variables in future analyses, to represent the original dataset from which they’re derived. At its core, factor analysis is the process of fitting a model to prepare a dataset for analysis by reducing its dimensionality and information redundancy.