Multiblock Data Fusion in Statistics and Machine Learning
Реклама. ООО «ЛитРес», ИНН: 7719571260.
Оглавление
Tormod Næs. Multiblock Data Fusion in Statistics and Machine Learning
Multiblock Data Fusion in Statistics and Machine Learning. Applications in the Natural and Life Sciences
Contents
List of Figures
List of Tables
Guide
Pages
Foreword
Preface
List of Figures
List of Tables
1 Introduction. 1.1 Scope of the Book
Glossary of terms
1.2 Potential Audience
1.3 Types of Data and Analyses. 1.3.1 Supervised and Unsupervised Analyses
1.3.2 High-, Mid- and Low-level Fusion
ELABORATION 1.2. High Level Supervised Fusion
1.3.3 Dimension Reduction
1.3.4 Indirect Versus Direct Data
1.3.5 Heterogeneous Fusion
1.4 Examples
1.4.1 Metabolomics
Terms in metabolomics and proteomics
Example 1.1: Metabolomics example: plant science data
1.4.2 Genomics
ELABORATION 1.4. Terms in genomics
Example 1.2: Genetics example
1.4.3 Systems Biology
ELABORATION 1.5. Terms in systems biology
1.4.4 Chemistry
ELABORATION 1.6. Terms in chemistry
Example 1.3: Chemistry example: Raman spectroscopy data
1.4.5 Sensory Science
ELABORATION 1.7. Terms in sensory analysis
Example 1.4: Sensory example: consumer liking
1.5 Goals of Analyses
1.6 Some History
1.7 Fundamental Choices
1.8 Common and Distinct Components
ELABORATION 1.8. Common and distinct in spectroscopy
1.9 Overview and Links
1.10 Notation and Terminology
1.11 Abbreviations
Notes
2 Basic Theory and Concepts. 2.i General Introduction
2.1 Component Models. 2.1.1 General Idea of Component Models
2.1.2 Principal Component Analysis
ELABORATION 2.1. Geometry of PCA
Example 2.1: PCA on wines
ELABORATION 2.2. The difference between PCA and factor analysis
2.1.3 Sparse PCA
ELABORATION 2.3. Sparse PCA
2.1.4 Principal Component Regression
2.1.5 Partial Least Squares
ELABORATION 2.4. Constraint on weights w gives the scores t as a least squares (LS) solution
Algorithm 2.1. NIPALS
Example 2.2: Multivariate calibration using PLS
2.1.6 Sparse PLS
Algorithm 2.2. Sparse NIPALS
2.1.7 Principal Covariates Regression
2.1.8 Redundancy Analysis
2.1.9 Comparing PLS, PCovR and RDA
2.1.10 Generalised Canonical Correlation Analysis
2.1.11 Simultaneous Component Analysis
2.2 Properties of Data
2.2.1 Data Theory
ELABORATION 2.5. Normalisation of urine NMR metabolomics data
ELABORATION 2.6. Genomics and proteomics analysis of glucose starvation
2.2.2 Scale-types
2.3 Estimation Methods. 2.3.1 Least-squares Estimation
Algorithm 2.3. Alternating Least Squares
2.3.2 Maximum-likelihood Estimation
ELABORATION 2.7. PCA for binary data
2.3.3 Eigenvalue Decomposition-based Methods
2.3.4 Covariance or Correlation-based Estimation Methods
2.3.5 Sequential Versus Simultaneous Methods
ELABORATION 2.8. Deflation in PLS
2.3.6 Homogeneous Versus Heterogeneous Fusion
ELABORATION 2.9. Optimal-scaling
2.4 Within- and Between-block Variation
2.4.1 Definition and Example
2.4.2 MAXBET Solution
2.4.3 MAXNEAR Solution
2.4.4 PLS2 Solution
2.4.5 CCA Solution
2.4.6 Comparing the Solutions
2.4.7 PLS, RDA and CCA Revisited
Algorithm 2.4. Iterative algorithms for PLS, RDA and CCA
2.5 Framework for Common and Distinct Components
2.6 Preprocessing
2.7 Validation
2.7.1 Outliers. 2.7.1.1 Residuals
2.7.1.2 Leverage
2.7.2 Model Fit
2.7.3 Bias-variance Trade-off
2.7.4 Test Set Validation
ELABORATION 2.10. Sample size
2.7.5 Cross-validation
ELABORATION 2.11. Cross-validation of unsupervised methods
2.7.6 Permutation Testing
2.7.7 Jackknife and Bootstrap
2.7.8 Hyper-parameters and Penalties
2.8 Appendix. COLUMN- AND ROW-SPACES
DIRECT SUM OF SPACES
POSITIVE DEFINITE MATRICES
SINGULAR VALUE DECOMPOSITION AND EIGEN DECOMPOSITION
TRACE AND VEC
NORMS OF VECTORS AND MATRICES
DEFLATION AND ORTHOGONALISATION
EXPLAINED SUM-OF-SQUARES
MULTICOLLINEARITY
MOORE–PENROSE INVERSE
REGRESSION COEFFICIENTS IN NIPALS BASED ALGORITHMS
EIGENVALUE EQUATIONS FOR PLS, RDA AND CCA
Notes
3 Structure of Multiblock Data. 3.i General Introduction
3.1 Taxonomy
3.2 Skeleton of a Multiblock Data Set
3.2.1 Shared Sample Mode
3.2.2 Shared Variable Mode
3.2.3 Shared Variable or Sample Mode
3.2.4 Shared Variable and Sample Mode
3.3 Topology of a Multiblock Data Set
3.3.1 Unsupervised Analysis
SHARED SAMPLE MODE
SHARED VARIABLE MODE
SHARED SAMPLE OR VARIABLE MODE
SHARED VARIABLE AND SAMPLE MODE
3.3.2 Supervised Analysis
3.4 Linking Structures
3.4.1 Linking Structure for Unsupervised Analysis
3.4.2 Linking Structures for Supervised Analysis
3.5 Summary
Notes
4 Matrix Correlations. 4.i General Introduction
4.1 Definition
ELABORATION 4.1. An interpretation of the SVD
4.2 Most Used Matrix Correlations. 4.2.1 Inner Product Correlation
GCD coefficient
4.2.3 RV-coefficient
4.2.4 SMI-coefficient
4.3 Generic Framework of Matrix Correlations
ELABORATION 4.2. SMIOP for the case of R1≠R2
4.4 Generalised Matrix Correlations
4.4.1 Generalised RV-coefficient
4.4.2 Generalised Association Coefficient
Example 4.1: GAC and common/distinct components
4.5 Partial Matrix Correlations
Example 4.2: iTOP: Inferring topology of genomics data
4.6 Conclusions and Recommendations
4.7 Open Issues. PROPERTIES AND PRACTICAL USE
FRAMEWORK FOR HETEROGENEOUS DATA
RELATIONSHIPS BETWEEN MATRIX CORRELATION AND COMMON/DISTINCT COMPONENTS
5 Unsupervised Methods. 5.I General Introduction
5.II Relations to the General Framework
5.1 Shared Variable Mode
5.1.1 Only Common Variation. 5.1.1.1 Simultaneous Component Analysis. THE SCA MODEL
ELABORATION 5.1. Different constraints in SCA models
PREPROCESSING
ELABORATION 5.2. Within-block scaling in SCA models
Validation
ELABORATION 5.3. Explained variances per block in SCA
5.1.1.2 Clustering and SCA
Fuzzy SCA clustering
Algorithm 5.1. Fuzzy SCA clustering
Clusterwise SCA
5.1.1.3 Multigroup Data Analysis
Example 5.1: Example of multigroup analysis
5.1.2 Common, Local, and Distinct Variation
5.1.2.1 Distinct and Common Components
Example 5.2: Example of DISCO in genomics
5.1.2.2 Multivariate Curve Resolution
ELABORATION 5.4. Stay in the row-space or not?
5.2 Shared Sample Mode
5.2.1 Only Common Variation. 5.2.1.1 Sum-Pca
ELABORATION 5.5. Block-scores
ELABORATION 5.6. Sparse SCA
5.2.1.2 Multiple Factor Analysis and STATIS
5.2.1.3 Generalised Canonical Analysis
ELABORATION 5.7. GCA as an eigenproblem
ELABORATION 5.8. GCA Correlation loadings
5.2.1.4 Regularised Generalised Canonical Correlation Analysis
ELABORATION 5.9. A simple example of RGCCA
5.2.1.5 Exponential Family SCA
ELABORATION 5.10. The logistic function
Example 5.3: GSCA example
5.2.1.6 Optimal-scaling
Basic idea
ELABORATION 5.11. Non-linear PCA
Multiblock optimal-scaling
5.2.2 Common, Local, and Distinct Variation
5.2.2.1 Joint and Individual Variation Explained
5.2.2.2 Distinct and Common Components
5.2.2.3 Pca-Gca
Example 5.4: Example of DISCO and PCA-GCA on sensory data
Algorithm 5.2. PCA-GCA for three blocks of data
Example 5.5: Example of DISCO and PCA-GCA in medical biology
5.2.2.4 Advanced Coupled Matrix and Tensor Factorisation
Example 5.6: Example of JIVE and ACMTF
5.2.2.5 Penalised-ESCA
ELABORATION 5.12. Group-wise penalties
5.2.2.6 Multivariate Curve Resolution
5.3 Generic Framework
5.3.1 Framework for Simultaneous Unsupervised Methods
5.3.1.1 Description of the Framework
ELABORATION 5.13. Association rules
5.3.1.2 Framework Applied to Simultaneous Unsupervised Data Analysis Methods
5.3.1.3 Framework of Common/Distinct Applied to Simultaneous Unsupervised Multiblock Data Analysis Methods
5.4 Conclusions and Recommendations. Properties of some of the methods
Which method to use?
4.7 Open Issues
META-PARAMETER OR HYPER-PARAMETER SELECTION
VARIABLE SELECTION
NON-LINEARITIES
MISSING DATA
OUTLIERS AND PERFORMANCE OF THE METHODS
Notes
6 ASCA and Extensions. 6.i General Introduction
6.ii Relations to the General Framework
6.1 ANOVA-Simultaneous Component Analysis
6.1.1 The ASCA Method
ELABORATION 6.1. ASCA: setup up of the matrices involved
Example 6.1: Plant Metabolomics
Example 6.2: Toxicology example
6.1.2 Validation of ASCA
6.1.2.1 Permutation Testing
Example 6.3: Simple permutation test
Example 6.4: Plant metabolomics validation
6.1.2.2 Back-projection
6.1.2.3 Confidence Ellipsoids
Example 6.5: ASCA: Sensory assessment of candies
6.1.3 The ASCA+ and LiMM-PCA Methods
6.2 Multilevel-SCA
6.3 Penalised-ASCA
Example 6.6: PE-ASCA: NMR metabolomics of pig brains
6.4 Conclusions and Recommendations
6.5 Open Issues
Notes
7 Supervised Methods. 7.I General Introduction
7.II Relations to the General Framework
7.1 Multiblock Regression: General Perspectives. 7.1.1 Model and Assumptions
7.1.2 Different Challenges and Aims
7.2 Multiblock PLS Regression
7.2.1 Standard Multiblock PLS Regression
Algorithm 7.1. MB-PLS
Example 7.1: MB-PLS: Raman on PUFA containing emulsions
MB-PLS vs PLS2
7.2.2 MB-PLS Used for Classification
Example 7.2: MB-PLS for classification: Metabolomics in colorectal cancer
7.2.3 Sparse Multiblock PLS Regression (sMB-PLS)
Algorithm 7.2. SPARSE MB-PLS
Example 7.3: Sparse MB-PLS in metabolomics
7.3 The Family of SO-PLS Regression Methods (Sequential and Orthogonalised PLS Regression)
7.3.1 The SO-PLS Method
Algorithm 7.3. SO-PLS
7.3.2 Order of Blocks
7.3.3 Interpretation Tools
7.3.4 Restricted PLS Components and their Application in SO-PLS
Algorithm 7.4. Restricted PLS Components and their Use in SO-PLS
7.3.5 Validation and Component Selection
7.3.6 Relations to ANOVA
Example 7.4: SO-PLS: Sensory assessment of wines
Example 7.5: SO-PLS: Raman on PUFA containing emulsions
SO-PLS vs MB-PLS
7.3.7 Extensions of SO-PLS to Handle Interactions Between Blocks
Example 7.6: Interactions through linear combinations
Example 7.7: SO-PLS: Incorporating interactions
7.3.8 Further Applications of SO-PLS
7.3.9 Relations Between SO-PLS and ASCA
Example 7.8: SO-PLS: Sensory assessment of candies
7.4 Parallel and Orthogonalised PLS (PO-PLS) Regression
Algorithm 7.5 13. PO-PLS
Example 7.9: PO-PLS: Raman on PUFA containing emulsions
PO-PLS versus MB-PLS and SO-PLS
7.5 Response Oriented Sequential Alternation
7.5.1 The ROSA Method
Algorithm 7.6. ROSA
7.5.2 Validation
7.5.3 Interpretation
Example 7.10: ROSA: Raman on PUFA containing emulsions
ROSA versus MB-PLS, SO-PLS and PO-PLS
7.6 Conclusions and Recommendations
INVARIANCE TO BETWEEN-BLOCK SCALING
CHOOSING THE NUMBER OF COMPONENTS
NUMBER OF UNDERLYING DIMENSIONS
COMMON VERSUS DISTINCT COMPONENTS
MODIFICATIONS AND EXTENSIONS OF ORIGINAL VERSIONS
7.7 Open Issues
8 Complex Block Structures; with Focus on L-Shape Relations. 8.i General Introduction
8.ii Relations to the General Framework
8.1 Analysis of L-shape Data: General Perspectives
8.2 Sequential Procedures for L-shape Data Based on PLS/PCR and ANOVA. 8.2.1 Interpretation of X1, Quantitative X2-data, Horizontal Axis First
ELABORATION 8.1 Missing data and validation
8.2.2 Interpretation of X1, Categorical X2-data, Horizontal Axis First
ELABORATION 8.2 Hybrid approaches of methods in Sections 8.2.1 and 8.2.2
8.2.3 Analysis of Segments/Clusters of X1 Data
Example 8.1: Preference mapping and segmentation of consumers
Example 8.2: Conjoint analysis, X2-matrix based on categorical variables
ELABORATION 8.3 Possible extensions
8.3 The L-PLS Method for Joint Estimation of Blocks in L-shape Data
8.3.1 The Original L-PLS Method, Endo-L-PLS
Algorithm 8.7 Endo-L-PLS algorithm with focus on enhanced interpretation
Example 8.3: Apple data analysed by Endo-L-PLS
8.3.2 Exo- Versus Endo-L-PLS
Algorithm 8.8 Exo-L-PLS
8.4 Modifications of the Original L-PLS Idea
8.4.1 Weighting Information from X3 and X1 in L-PLS Using a Parameter α
Example 8.4: Genomics and breast cancer classification
8.4.2 Three-blocks Bifocal PLS
8.5 Alternative L-shape Data Analysis Methods
8.5.1 Principal Component Analysis with External Information
8.5.2 A Simple PCA Based Procedure for Using Unlabelled Data in Calibration
8.5.3 Multivariate Curve Resolution for Incomplete Data
8.5.4 An Alternative Approach in Consumer Science Based on Correlations Between X3 and X1
8.6 Domino PLS and More Complex Data Structures
8.7 Conclusions and Recommendations
8.8 Open Issues
9 Alternative Unsupervised Methods. 9.i General Introduction
9.ii Relationship to the General Framework
9.1 Shared Variable Mode
9.2 Shared Sample Mode. 9.2.1 Only Common Variation. 9.2.1.1 DIABLO
ELABORATION 9.1. Example of DIABLO
9.2.1.2 Generalised Coupled Tensor Factorisation
ELABORATION 9.2. Divergence measures
9.2.1.3 Representation Matrices
REPRESENTATION MATRICES FOR RATIO-, INTERVAL-, AND ORDINAL-SCALED VARIABLES
Example 9.1 Representation matrices
REPRESENTATION MATRICES FOR NOMINAL-SCALED VARIABLES
USING REPRESENTATION MATRICES IN HETEROGENEOUS DATA FUSION
ELABORATION 9.3. INDSCAL and INDORT
Example 9.2 Analysing heterogeneous genomics data
9.2.1.4 Extended PCA
9.2.2 Common, Local, and Distinct Variation
9.2.2.1 Generalised SVD
9.2.2.2 Structural Learning and Integrative Decomposition
9.2.2.3 Bayesian Inter-battery Factor Analysis
Example 9.3 Example of BIBFA and ACMTF
9.2.2.4 Group Factor Analysis
9.2.2.5 OnPLS
9.2.2.6 Generalised Association Study
9.2.2.7 Multi-Omics Factor Analysis
Example 9.4 PESCA versus MOFA
9.3 Two Shared Modes and Only Common Variation
9.3.1 Generalised Procrustes Analysis
9.3.2 Three-way Methods
9.4 Conclusions and Recommendations
9.4.1 Open Issues
PRIORS AND PENALTIES
PROPERTIES OF THE ESTIMATED PARAMETERS
Notes
10 Alternative Supervised Methods. 10.i General Introduction
10.ii Relations to the General Framework
10.1 Model and Focus
10.2 Extension of PCovR
10.2.1 Sparse Multiblock Principal Covariates Regression, Sparse PCovR
10.2.2 Multiway Multiblock Covariates Regression
Example10.1 Multiway multiblock covariates regression model of batch process data
10.3 Multiblock Redundancy Analysis
10.3.1 Standard Multiblock Redundancy Analysis
Example10 2 Multiblock redundancy analysis: sensory assessment of wines
10.3.2 Sparse Multiblock Redundancy Analysis
Algorithm 10.1 SPARSE MULTIBLOCK REDUNDANCY ANALYSIS
10.4 Miscellaneous Multiblock Regression Methods
10.4.1 Multiblock Variance Partitioning
Algorithm 10.4 Summary of multiblock variance partitioning for three input blocks
10.4.2 Network Induced Supervised Learning
10.4.3 Common Dimensions for Multiblock Regression
10.5 Modifications and Extensions of the SO-PLS Method. 10.5.1 Extensions of SO-PLS to Three-Way Data
10.5.2 Variable Selection for SO-PLS
10.5.3 More Complicated Error Structure for SO-PLS
Elaboration 10.5 SO-PLS with more complicated error structure
10.5.4 SO-PLS Used for Path Modelling
ELABORATION 10.2 Definition of topological order of a DAG
Example 10.2 Wine example
Elaboration 10.3 SO-PLS related methods
10.6 Methods for Data Sets Split Along the Sample Mode, Multigroup Methods. 10.6.1 Multigroup PLS Regression
10.6.2 Clustering of Observations in Multiblock Regression
10.6.3 Domain-Invariant PLS, DI-PLS
10.7 Conclusions and Recommendations
COMBINING GROUPS OF SAMPLES IN ONE SINGLE MODEL
COLLINEAR X-DATA
COMBINING TWO-WAY AND THREE-WAY INPUT BLOCKS
DISTINCT AND COMMON VARIABILITY
10.8 Open Issues
11 Algorithms and Software. 11.1 Multiblock Software
Chapter structure
11.2 R package multiblock
11.3 Installing and Starting the Package
11.4 Data Handling
11.4.1 Read From File
11.4.2 Data Pre-processing
11.4.3 Re-coding Categorical Data
11.4.4 Data Structures for Multiblock Analysis. 11.4.4.1 Create List of Blocks
11.4.4.2 Create data.frame of Blocks
11.5 Basic Methods
11.5.1 Prepare Data
11.5.2 Modelling
11.5.3 Common Output Elements Across Methods
11.5.4 Scores and Loadings
11.6 Unsupervised Methods
11.6.1 Formatting Data for Unsupervised Data Analysis
11.6.2 Method Interfaces
11.6.3 Shared Sample Mode Analyses
11.6.4 Shared Variable Mode
11.6.5 Common Output Elements Across Methods
11.6.6 Scores and Loadings
11.6.7 Plot From Imported Package
11.7 ANOVA Simultaneous Component Analysis
11.7.1 Formula Interface
11.7.2 Simulated Data
11.7.3 ASCA Modelling
11.7.4 ASCA Scores
11.7.5 ASCA Loadings
11.8 Supervised Methods
11.8.1 Formatting Data for Supervised Analyses
11.8.2 Multiblock Partial Least Squares
11.8.2.1 MB-PLS Modelling
11.8.2.2 MB-PLS Summaries and Plotting
11.8.3 Sparse Multiblock Partial Least Squares
11.8.3.1 Sparse MB-PLS Modelling
11.8.3.2 Sparse MB-PLS Plotting
11.8.4 Sequential and Orthogonalised Partial Least Squares
11.8.4.1 SO-PLS Modelling
11.8.4.2 Måge Plot
11.8.4.3 SO-PLS Loadings
11.8.4.4 SO-PLS Scores
11.8.4.5 SO-PLS Prediction
11.8.4.6 SO-PLS Validation
11.8.4.7 Principal Components of Predictions
11.8.4.8 CVANOVA
11.8.5 Parallel and Orthogonalised Partial Least Squares
11.8.5.1 PO-PLS Modelling
11.8.5.2 PO-PLS Scores and Loadings
11.8.6 Response Optimal Sequential Alternation
11.8.6.1 ROSA Modelling
11.8.6.2 ROSA Loadings
11.8.6.3 ROSA Scores
11.8.6.4 ROSA Prediction
11.8.6.5 ROSA Validation
11.8.6.6 ROSA Image Plots
11.8.7 Multiblock Redundancy Analysis
11.8.7.1 MB-RDA Modelling
11.8.7.2 MB-RDA Loadings and Scores
11.9 Complex Data Structures
11.9.1 L-PLS
11.9.1.1. Simulated L-shaped Data
11.9.1.2 Exo-L-PLS
11.9.1.3 Endo-L-PLS
11.9.1.4 L-PLS Cross-validation
11.9.2 SO-PLS-PM
11.9.2.1 Single SO-PLS-PM Model
11.9.2.2 Multiple Paths in an SO-PLS-PM Model
11.10 Software Packages. 11.10.1 R Packages
11.10.2 MATLAB Toolboxes
11.10.3 Python
11.10.4 Commercial Software
References
Index
WILEY END USER LICENSE AGREEMENT
Отрывок из книги
Age K. Smilde
Swammerdam Institute for Life Sciences, University of Amsterdam,
.....
Figure 6.14 PE-ASCA of the NMR metabolomics of pig brains. Stars inthe score plots are the factor estimates and circles are theback-projected individual measurements (Zwanenburg et al., 2011). Source: Alinaghi et al. (2020). Licensed under CC BY 4.0.
Figure 6.15 Tree for selecting an ASCA-based method. For abbrevi-ations, see the legend of Table 6.1; BAL=Balanced data,UNB=Unbalanced data. For more explanation, see text.
.....