Читать книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs - Страница 20
1.3.3 Dimension Reduction
ОглавлениеThere are many ways to perform multiblock data analysis. We restrict the focus of this book to approaches that use dimension reduction methods to tackle multiblock problems. The basic idea of dimension reduction methods is to extract components or latent variables from data blocks, see Figure 1.2.
Figure 1.2 Idea of dimension reduction and components. The scores T summarise the relationships between samples; the loadings P summarise the relationships between variables. Sometimes weights W are used to define the scores.
In this figure, the matrix X(I×J) consists of J variables measured on I samples. The matrix W(J×R) of weights defines the scores XW=T(I×R) where R is much smaller than J. This is the dimension reduction (or data compression) part and the idea is that T represents the samples in matrix X in a good way depending on the purpose. Likewise, the variables are represented in the loadings P(J×R) which can be connected to the scores in a least squares sense, e.g., in the model X=TPt+E. There are many alternatives to compute the weights, scores, and loadings depending on the specific situation; this will be explained in subsequent chapters.
The idea of dimension reduction by using components or latent variables is very old and has proven to be a very powerful paradigm, with many applications in the natural- life- and social sciences. When considering multiple blocks of data, each block is summarised by its components and the relationships between the blocks is then modelled by building relationships between those components. There are also many ways to build such relationships and we will discuss those in this book.
There are many reasons for and advantages of using dimension reduction methods:
The number of sources of variability in data blocks is usually (much) smaller than the number of measured variables.
Component-based methods are suitable for interpretation through the scores and loadings associated with the extracted components.
Underlying components and latent variables are appropriate for mental abstractions and interpretation.
Multivariate data analysis becomes numerically stable and statistically robust if the components are chosen in a suitable way.
Empirical validation of the models becomes manageable.
The effect of measurement noise is reduced.
Outliers can often be detected by visual inspection of the associated subspace projections provided by the extracted components.