Читать книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs - Страница 11

List of Figures

Figure 1.1 High-level, mid-level, and low-level fusion for two input blocks.The Z’s represent the combined information from the twoblocks which is used for making the predictions. The upperfigure represents high-level fusion, where the results from two separate analyses are combined. The figure in the middle is an illustration of mid-level fusion, where components from the two data blocks are combined before further analysis. The lowerfigure illustrates low-level fusion where the data blocks are simply combined into one data block before further analysis takes place.

Figure 1.2 Idea of dimension reduction and components. The scores T summarise the relationships between samples; the load-ings P summarise the relationships between variables.Sometimes weights W are used to define the scores.

Figure 1.3 Design of the plant experiment. Numbers in the top row refer to lightlevels (in μE m⁻² sec⁻¹); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.

Figure 1.4 Scores on the first two principal components of a PCA on theplant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.

Figure 1.5 Idea of copy number variation (a), methylation (b), and mutation (c)of the DNA. For (a) and (c): Source: Adapted from Koch et al., 2012.

Figure 1.6 Plot of the Raman spectra used in predicting the fat content. The dashed lines show the split of the data set into multiple blocks.

Figure 1.7 L-shape data of consumer liking studies.

Figure 1.8 Phylogeny of some multiblock methods and relationsto basic data analysis methods used in this book.

Figure 1.9 The idea of common and distinct components. Legend: blueis common variation; dark yellow and dark red are distinctvariation and shaded areas are noise (unsystematic variation).

Figure 2.1 Idea of dimension reduction and components. Sometimes W isused to define the scores T which in turn define the loadings P.

Figure 2.2 Geometry of PCA. For explanation, see text (withpermission of H.J. Ramaker, TIPb, The Netherlands).

Figure 2.3 Score (a) and loading (b) plots of a PCA on Caber-net Sauvignon wines. Source: Bro and Smilde (2014).Reproduced with permission of Royal Society of Chemistry.

Figure 2.4 PLS validated explained variance when applied to Ramanwith PUFA responses. Left: PLSR on one responseat a time. Right: PLS on both responses (standardised).

Figure 2.5 Score and loading plots for the single response PLS regression model predicting PUFA as percentage of total fat in the sample (PUFAsample).

Figure 2.6 Raw and normalised urine NMR-spectra.Different colours are spectra of different subjects.

Figure 2.7 Numerical representations of the lengths of sticks: (a) left: the empirical relational system (ERS) of which only the length is studied, right: a numerical representation (NRS1), (b) an alternative numerical representation (NRS2) ofthe same ERS carrying essentially the same information.

Figure 2.8 Classical (a) and logistic PCA (b) on the same muta-tion data of different cancers. Source Song et al. (2017). Reproduced with permission from Oxford Academic Press.

Figure 2.9 Classical (a) and logistic PCA (b) on the same methyla-tion data of different cancers. Source Song et al. (2017). Reproduced with permission from Oxford Academic.

Figure 2.10 SCA for two data blocks; one containingbinary data and one with ratio-scaled data.

Figure 2.11 The block scores of the rows of the two blocks. Legend:green squares are block scores of the first block; blue circlesare block scores of the second block and the red stars aretheir averages (indicated with t_a). Panel (a) favouring block X₁, (b) the MAXBET solution, (c) the MAXNEAR solution.

Figure 2.12 Two column-spaces each of rank two in three-dimensional space.The blue and green surfaces represent the column-spaces and the redline indicated with X_12C represents the common component. Source:Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 2.13 Common and distinct components. The common componentis the same in both panels. For the distinct componentsthere are now two choices regarding orthogonality: (a) bothdistinct components orthogonal to the common component, (b) distinct components mutually orthogonal. Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 2.14 Common components in case of noise: (a) maximally correlated common components within column-spaces; (b) consensus component in neither of the columns-spaces. Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 2.15 Visualisation of a response vector, y, projected ontoa two-dimensional data space spanned by x₁ and x₂.

Figure 2.16 Fitted values versus residuals from a linear regression model.

Figure 2.17 Simple linear regression: ŷ = ax + b (see legend for description of elements). In addition, leverage is indi-cated below the regression plot, where leverage is at a minimum at ¯x and increases for lower and higher x-values.

Figure 2.18 Two-variable multiple linear regression with indicated residuals and leverage (contours below regression plane).

Figure 2.19 Two component PCA score plot of concatenated Raman data.Leverage for two components is indicated by the marker size.

Figure 2.20 Illustration of true versus predicted values from aregression model. The ideal line is indicated in dashed green.

Figure 2.21 Visualisation of bias variance trade-off as a function of model complex-ity. The observed MSE (in blue) is the sum of the bias² (red dashed),the variance (yellow dashed) and the irreducible error (purple dotted).

Figure 2.22 Learning curves showing how median R² and Q² from linear regression develops with the number of training samples for a simulated data set.

Figure 2.23 Visualisation of the process of splitting a data set into a set ofsegments (here chosen to be consecutive) and the sequentialhold-out of one segment (V_k) for validation of models. Alldata blocks X_m and the response Y are split along the sampledirection and corresponding segments removed simultaneously.

Figure 2.24 Cumulative explained variance for PCA of the concatenatedRaman data using naive cross-validation (only leavingout samples). R² is calibrated and Q² is cross-validated.

Figure 2.25 Null distribution and observed test statistic usedfor significance estimation with permutation testing.

Figure 3.1 Skeleton of a three-block data set with a shared sample mode.

Figure 3.2 Skeleton of a four-block data set with a shared sample mode.

Figure 3.3 Skeleton of a three-block data set with a shared variable mode.

Figure 3.4 Skeleton of a three-block L-shaped data setwith a shared variable or a shared sample mode.

Figure 3.5 Skeleton of a four-block U-shaped data set with a shared variable or ashared sample mode (a) and a four-block skeleton with a shared variableand a shared sample mode (b). This is a simplified version; it should be understood that all sample modes are shared as well as all variable modes.

Figure 3.6 Topology of a three-block data set with a shared sample mode and unsupervised analysis: (a) full topology and (b) simplified representation.

Figure 3.7 Topology of a three-block data set with ashared variable mode and unsupervised analysis.

Figure 3.8 Different arrangements of data sharing twomodes. Topology (a) and multiway array (b).

Figure 3.9 Unsupervised combination of a three-way and two-way array.

Figure 3.10 Supervised three-set problem sharing the sample mode.

Figure 3.11 Supervised L-shape problem. Block X₁ is a predic-tor for block X₂ and extra information regardingthe variables in block X₁ is available in block X₃.

Figure 3.12 Path model structure. Blocks are connected throughshared samples and a causal structure is assumed.

Figure 3.13 Idea of linking two data blocks with ashared sample mode. For explanation, see text.

Figure 3.14 Different linking structures: (a) identity link, (b) flexible link, (c) partial identity link: common (T_12C) and distinct (T_1D, T_2D) components.

Figure 3.15 Idea of linking two data blocks with shared variable mode.

Figure 3.16 Different linking structures for supervised analysis: (a) linking structure where components are used both for the X-blocks and the Y-block; (b) linking structure that only uses components for the X-blocks.

Figure 3.17 Treating common and distinct linking structures for supervised analysis: (a) Linking structure with no differentiation between common and distinct in the X-blocks (C is common, D₁, D₂ are distinct for X₁ and X₂, respectively; e X₁ and e X₂ represent the unsystematic parts of X₁ and X₂); (b) first X₁ is used and then the remainder of X₂ after removing common (predictive) part T₁ of X₁.

Figure 4.1 Explanation of the scale (a) and orientation (b) component of the SVD.The axes are two variables and the spread of the samples are visualised including their contours as ellipsoids. Hence, this is a representation ofthe row-spaces of the matrices. For more explanation, see text. Source: Smilde et al. (2015). Reproduced with permission of John Wiley and Sons.

Figure 4.2 Topology of interactions between genomics data sets. Source: Aben et al. (2018). Reproduced with permission of Oxford University Press.

Figure 4.3 The RV and partial RV coefficients for the genomics example.For explanation, see the main text. Source: Aben et al. (2018). Reproduced with permission of Oxford University Press.

Figure 4.4 Decision tree for selecting a matrix correlation method.Abbreviations: HOM is homogeneous data, HET is heterogeneousdata, Gen-RV is generalised RV, Full means full correlations,Partial means partial correlations. For more explanation, see text.

Figure 5.1 Unsupervised analysis as discussed in this chapter, (a) links between samples and (b) links betweenvariables (simplified representations, see Chapter 3).

Figure 5.2 Illustration explaining the idea of exploring multiblock data. Source:Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 5.3 The idea of common (C), local (L) and distinct (D) parts of three datablocks. The symbols X^t denote row spaces; X^t_13L, e.g., is the part of X^t₁ and X^t₃ which is in common but does not share a partwith X^t₂.

Figure 5.4 Proportion of explained variances (variances accounted for)for the TIV Block (upper part); the LAIV block (mid-dle part) and the concatenated blocks (lower part). Source:Van Deun et al. (2013). Reproduced with permission of Elsevier.

Figure 5.5 Row-spaces visualised. The true row space (blue) contains thepure spectra (blue arrows). The row-space of X is the green plane which contains the estimated spectra (green arrows). The redarrows are off the row-space and closer to the true pure spectra.

Figure 5.6 Difference between weights and correlation loadings explained. Green arrows are variables of X_m; red arrow is the consensus component t; bluearrow is the common component t_m. Dotted lines represent projections.

Figure 5.7 The logistic function η(θ) = (1 + exp(−θ))⁻¹ visualised. Only thepart between [−4,4] is shown but the function goes from −∞ to +∞.

Figure 5.8 CNA data visualised. Legend: (a) each line is a sample (cell line),blanks are zeros and black dots are ones; (b) the proportionof ones per variable illustrating the unbalancedness. Source:Song et al. (2021). Reproduced with permission of Elsevier.

Figure 5.9 Score plot of the CNA data. Legend: (a) scores of a logisticPCA on CNA; (b) consensus scores of the first two GSCA components of a GSCA model (MITF is a special gene). Source: Smilde et al. (2020). Licensed under CC BY 4.0.

Figure 5.10 Plots for selecting numbers of components for the sensory example. (a) SCA: the curve represents cumulative explained variance for the concatenated data blocks. The bars show how much variance each component explains in the individual blocks. (b) DISCO: each point represents the non-congruence value for a given target (model).The plot includes all possible combinations of common and distinct components based on a total rank of three. The horizontal axis represents the number of common components and the numbers inthe plot represent the number of distinct components for SMELLand TASTE, respectively. (c) PCA-GCA: black dots representthe canonical correlation coefficients between the PCA scoresof the two blocks (x100) and the bars show how much variancethe canonical components explain in each block. Source: Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 5.11 Biplots from PCA-GCA, showing the variables as vectors and the samples as points. The samples are labelled according to the design factors flavour type (A/B), sugar level (40,60,80) and flavour dose (2,5,8). The plots show the common component (horizontal) againstthe first distinct component for each of the two blocks. Source: Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 5.12 Amount of explained variation in the SCA model (a) and PCA models (b) of the medical biology metabolomics example. Source: Smilde et al. (2017). Reproduced with permission of John Wiley and Sons.

Figure 5.13 Amount of explained variation in the DISCO and PCA-GCAmodel. Legend: C-ALO is common across all blocks; C-ALis local between block A and L; D-A, D-O, D-L are distinctin the A, O and L blocks, respectively. Source: Smilde et al. (2017). Reproduced with permission of John Wiley andSons.

Figure 5.14 Scores (upper part) and loadings (lower part) of the com-mon DISCO component. Source: Smilde et al. (2017).Reproduced with permission of John Wiley and Sons.

Figure 5.15 ACMTF as applied on the combination of a three-way and atwo-way data block. Legend: an ‘x’ means a non-zero value on the superdiagonal (three-way block) or the diagonal (two-way block).The three-way block is decomposed by a PARAFAC model. The redpart of T is the common component, the blue part is distinct for X₁,and the yellow part is distinct for X₂ (see also the x and 0 values).

Figure 5.16 True design used in mixture preparation (blue) versus thecolumns of the associated factor matrix corresponding to themixtures mode extracted by the JIVE model (red). Source:Acar et al. (2015). Reproduced with permission of IEEE.

Figure 5.17 True design used in mixture preparation (blue) versus thecolumns of the associated factor matrix corresponding tothe mixtures mode extracted by the ACMTF model (red).Source: Acar et al. (2015). Reproduced with permission of IEEE.

Figure 5.18 Example of the properties of group-wise penalties. Left panel: the family of group-wise L-penalties. Right panel: the GDP penalties.The x-axis shows the L₂ norm of the original group of elementsto be penalised; the y-axis shows the value of this norm afterapplying the penalty. More explanation, see text. Source: Song et al. (2021). Reproduced with permission of John Wiley and Sons.

Figure 5.19 Quantification of modes and block-association rules.The matrix V ‘glues together’ the quantifications T and P using the function f = (T, P, V) to approximate X.

Figure 5.20 Linking the blocks through their quantifications.

Figure 5.21 Decision tree for selecting an unsupervised method forthe shared variable mode case. For abbreviations, seethe legend of Table 5.1. For more explanation, see text.

Figure 5.22 Decision tree for selecting an unsupervisedmethod for the shared sample mode case.For abbreviations, see the legend ofTable 5.1. For more explanation, see text.

Figure 6.1 ASCA decomposition for two metabolites. Thebreak-up of the original data into factor estimatesdue to the factors Time and Treatment is shown¹.

Figure 6.2 A part of the ASCA decomposition. Similarto Figure 6.1 but now for 11 metabolites.

Figure 6.3 The ASCA scores on the factor light in the plant example (panel (a); expressed in terms of increasing amount of light) and the corresponding loading for the first ASCA component (panel (b)).

Figure 6.4 The ASCA scores on the factor time in the plant example (panel (a)) andthe corresponding loading for the first ASCA component (panel (b)).

Figure 6.5 The ASCA scores on the interaction between light and timein the plant example (panel (a)) and the correspondingloading for the first ASCA component (panel (b)).

Figure 6.6 PCA on toxicology data. Source: Jansen et al. (2008).Reproduced with permission of John Wiley and Sons. 174 ¹ We thank Frans van der Kloet for making these figures.

Figure 6.7 ASCA on toxicology data. Component 1: left;component 2: right. Source: Jansen et al. (2008).Reproduced with permission of John Wiley and Sons.

Figure 6.8 PARAFASCA on toxicology data. Component 1: left; component 2: right. The vertical dashed lines indicate the boundary betweenthe early and late stages of the experiment. Source: Jansen et al. (2008). Reproduced with permission of John Wiley and Sons.

Figure 6.9 Permutation example. Panel (a): null-distribution for the first case withan effect (with size indicated with red vertical line). Panel (b): the dataof the case with an effect. Panel (c): the null-distribution of the casewithout an effect and the size (red vertical line). Panel (d): the data of thecase with no effect. Source: Vis et al. (2007). Licensed under CC BY 2.0.

Figure 6.10 Permutation test for the factor light (panel (a)) and inter-action between light and time (panel (b)). Legend: blue isthe null-distribution and effect size is indicated by a redvertical arrow. SSQ is the abbreviation of sum-of-squares.

Figure 6.11 ASCA candy scores from candy experiment. The plot to theleft is based on the ellipses from the residual approach inFriendly et al. (2013). The plot to the right is based on themethod suggested in Liland et al. (2018). Source: Liland et al. (2018). Reproduced with permission of John Wiley and Sons.

Figure 6.12 ASCA assessor scores from candy experiment. The plot tothe left is based on the ellipses from the residual approachin Friendly et al. (2013). The plot to the right is based on themethod suggested in Liland et al. (2018). Source: Liland et al. (2018). Reproduced with permission of John Wiley and Sons.

Figure 6.13 ASCA assessor and candy loadings from the candy experiment. Source:Liland et al. (2018). Reproduced with permission of John Wiley and Sons.

Figure 6.14 PE-ASCA of the NMR metabolomics of pig brains. Stars inthe score plots are the factor estimates and circles are theback-projected individual measurements (Zwanenburg et al., 2011). Source: Alinaghi et al. (2020). Licensed under CC BY 4.0.

Figure 6.15 Tree for selecting an ASCA-based method. For abbrevi-ations, see the legend of Table 6.1; BAL=Balanced data,UNB=Unbalanced data. For more explanation, see text.

Figure 7.1 Conceptual illustration of the handling of common and distinctpredictive information for three of the methods covered. The upperfigure illustrates that the two input blocks share some information (C₁ and C₂), but also have substantial distinct components andnoise (see Chapter 2), here contained in the X (as the darker blueand darker yellow). The lower three figures show how differentmethods handle the common information. For MB-PLS, no initial separation is attempted since the data blocks are concatenated before analysis starts. For SO-PLS, the common predictive informationis handled as part of the X₁ block before the distinct part ofthe X₂ block is modelled. The extra predictive information in X₂ corresponds to the additional variability as will be discussed in the SO-PLS section. For PO-PLS, the common informationis explicitly separated from the distinct parts before regression.

Figure 7.2 Illustration of link between concatenated X blocks andthe response, Y, through the MB-PLS super-scores, T.

Figure 7.3 Cross-validated explained variance for various choices of number of components for single- and two-response modelling with MB-PLS.

Figure 7.4 Super-weights (w) for the first and second componentfrom MB-PLS on Raman data predicting the PUFA sampleresponse. Block-splitting indicated by vertical dotted lines.

Figure 7.5 Block-weights (w_m) for first and second componentfrom MB-PLS on Raman data predicting the PUFAsampleresponse. Block-splitting indicated by vertical dotted lines.

Figure 7.6 Block-scores (t_m, for left, middle, and right Raman block,respectively) for first and second component from MB-PLS onRaman data predicting the PUFA sample response. Colours of thesamples indicate the PUFA concentration as % in fat (PUFA_fat)and size indicates % in sample (PUFA sample). The two percentagesgiven in each axis label are cross-validated explained variancefor PUFA sample weighted by relative block contributions andcalibrated explained variance for the block (X_m), respectively.

Figure 7.7 Classification by regression. A dummy matrix (here with threeclasses, c for class) is constructed according to which groupthe different objects belong to. Then this dummy matrix isrelated to the input blocks in the standard way described above.

Figure 7.8 AUROC values of different classification tasks. Source: (Deng et al., 2020). Reproduced with permission from ACS Publications.

Figure 7.9 Super-scores (called global scores here) and block-scores for thesparse MB-PLS model of the piglet metabolomics data. Source: (Karaman et al., 2015). Reproduced with permission from Springer.

Figure 7.10 Linking structure of SO-PLS. Scores for both X₁ and the orthogonalised version of X₂ are combined in a standardLS regression model with Y as the dependent block.

Figure 7.11 The SO-PLS iterates between PLS regression and orthogonalisation, deflating the input block and responses in every cycle. This isillustrated using three input blocks X₁, X₂, and X₃. The upperfigure represents the first PLS regression of Y onto X₁. Then the residuals from this step, obtained by orthogonalisation, goes tothe next (figure in the middle) where the same PLS procedure is repeated. The same continues for the last block X₃ in the lower partof the figure. In each step, loadings, scores, and weights are available.

Figure 7.12 The CVANOVA is used for comparing cross-validated residuals Ffor different prediction methods/models or for different numbers of blocks in the models (in for instance SO-PLS). The squares or the absolute values of the cross-validated prediction residuals, D_ik, are compared using a two-way ANOVA model. The figure below the model represents the data set used. The indices i and k denote the two effects: sample and method. The I samples for each method/model (equal to three in the example) are the same, so astandard two-way ANOVA is used. Note that the error variancein the ANOVA model for the three methods is not necessarilythe same, so this must be considered a pragmatic approach.

Figure 7.13 Måge plot showing cross-validated explained variance for all combinations of components for the four input blocks (up tosix components in total) for the wine data (the digits for each combination correspond to the order A, B, C, D, as describedabove). The different combinations of components are visualisedby four numbers separated by a dot. The panel to the lower rightis a magnified view of the most important region (2, 3, and 4 components) for selecting the number of components. Coloured linesshow prediction ability (Q², see cross-validation in Section 2.7.5)for the different input blocks, A, B, C, and D, used independently.

Figure 7.14 PCP plots for wine data. The upper two plots are the score andloading plots for the predicted Y, the other three are the projectedinput X-variables from the blocks B, C, and D. Block A is not presentsince it is not needed for prediction. The sizes of the points for the Y scores follow the scale of the ‘overall quality’ (small to large) whilecolour follows the scale of ‘typical’ (blue, through green to yellow).

Figure 7.15 Måge plot showing cross-validated explained variance forall combinations of components from the three blockswith a maximum of 10 components in total. The threecoloured lines indicate pure block models, and the insetis a magnified view around maximum explained variance.

Figure 7.16 Block-wise scores (T_m) with 4+3+3 components for left, mid-dle, and right block, respectively (two first components foreach block shown). Dot sizes show the percentage PUFAin sample (small = 0%, large = 12%), while colour showsthe percentage PUFA in fat (see colour-bar on the left).

Figure 7.17 Block-wise (projected) loadings with 4+3+3 components forleft, middle, and right block, respectively (two first for eachblock shown). Dotted vertical lines indicate transition betweenblocks. Note the larger noise level for components six and nine.

Figure 7.18 Block-wise loadings from restricted SO-PLS model with 4+3+3 components for left, middle, and right block, respectively (two first for each block shown). Dotted vertical lines indicate transition between blocks.

Figure 7.19 Måge plot for restricted SO-PLS showing cross-validatedexplained variance for all combinations of components fromthe three blocks with a maximum of 10 components in total.The three coloured lines indicate pure block models, and theinset is a magnified view around maximum explained variance.

Figure 7.20 CV-ANOVA results based on the cross-validated SO-PLS modelsfitted on the Raman data. The circles represent the average absolutevalues of the difference between measured and predicted response, D_ik = |y_ik − ŷ_ik|, (from cross-validation) obtained as new blocks are incorporated. The four ticks on the x-axis represent the different models from the simplest (intercept, predict using average response value) tothe most complex containing all the three blocks (‘X left’, ‘X middle’and ‘X right’). The vertical lines indicate (random) error regions forthe models obtained. Overlap of lines means no significant difference according to Tukey’s pair-wise test (Studentised range) obtained fromthe CV-ANOVA model. This shows that the ‘X middle’ adds significantlyto predictive ability, while ‘X right’ has a negligible contribution.

Figure 7.21 Loadings from Principal Components of Predictions appliedto the 5+4+0 component solutions of SO-PLS on Raman data.

Figure 7.22 RMSEP for fish data with interactions. The standard SO-PLS procedureis used with the order of blocks described in the text. The three curves correspond to different numbers of components for the interaction part.The symbol * in the original figure (see reference) between the blocks isthe same interaction operator as described by the ∘ above. Source: (Næs et al., 2011b). Reproduced with permission from John Wiley and Sons.

Figure 7.23 Regression coefficients for the interactions for the fish data with 4+2+2 components for blocks X₁, X₂ and the interaction block X₃. Regression coefficients are obtained by back-transforming the components inthe interaction block to original units in a similar way as shown rightafter Algorithm 7.3. The full regression vector for the interactionblock (with 24 terms, see above) is split into four parts according tothe four levels of the two design factors (see description of codingabove). Each of the levels of the design factor has its own line inthe figure. As can be seen, there are only two lines for each designfactor, corresponding to the way the design matrix was handled (see explanation at the beginning of the example). The number onthe x-axis represent wavelengths in the NIR region. Lines close to 0 are factor combinations which do not contribute to interaction.Source: Næs et al. (2011a). Reproduced with permission from Wiley.

Figure 7.24 SO-PLS results using candy and assessor variables (dummyvariables) as X and candy attribute assessments as Y. Component numbers in parentheses indicate how many componentswere extracted in the other block before the current block.

Figure 7.25 Illustration of the idea behind PO-PLS for three input blocks, to beread from left to right. The first step is data compression of each block separately (giving scores T₁, T₂ and T₃) before a GCA is run to obtain common components. Then each block is orthogonalised (both the X_mand Y) with respect to the common components, and PLS regressionis used for each of the blocks separately to obtain block-wise distinctscores. The F in the figure is the orthogonalised Y. The common andblock wise-scores are finally combined in a joint regression model.Note that the different T blocks can have different numbers of columns.

Figure 7.26 PO-PLS calibrated/fitted and validated explained variancewhen applied to three-block Raman with PUFA responses.

Figure 7.27 PO-PLS calibrated explained variance when appliedto three-block Raman with PUFA responses.

Figure 7.28 PO-PLS common scores when applied to three-block Raman withPUFA responses. The plot to the left is for the first component from X1,2,3 versus X_1,2 and the one to the right is for first component from X1,2,3 versus X_1,3. Size and colour of the points follow the amountof PUFA % in sample and PUFA % in fat, respectively (see alsothe numbers presented in the text for the axes). The percentages reported in the axis labels are calibrated explained variance forthe two responses, corresponding to the numbers in Figure 7.26.

Figure 7.29 PO-PLS common loadings when applied tothree-block Raman with PUFA responses.

Figure 7.30 PO-PLS distinct loadings when applied tothree-block Raman with PUFA responses.

Figure 7.31 ROSA component selection searches among candidate scores (t_m) from all blocks for the one that minimises the distance tothe residual response Y. After deflation with the winning score (Y_new = Y − t_rq^′_r = Y − t_rt^t_rY) the process is repeated until a desirednumber of components has been extracted. Zeros in weights are shown in white for an arbitrary selection of blocks, here blocks 2,1,3,1. Loadings, P, and weights, W (see text), span all blocks.

Figure 7.32 Cross-validated explained variance when ROSA is applied tothree-block Raman with PUFA in sample and in fat on theleft and both PUFA responses simultaneously on the right.

Figure 7.33 ROSA weights (five first components) when appliedto three-block Raman with the PUFA sample response.

Figure 7.34 Summary of cross-validated candidate scores from blocks. Top: residual RMSECV (root mean square error of cross-validation)for each candidate component. Bottom: correlation between candidate scores and the score from the block that was selected. White dots show which block was selected for each component.

Figure 7.35 The decision paths for ‘Common and distinct components; (implicitly handled, additional contribution from block or explicitly handled)and ‘Choosing components’ (single choice, for each block ormore complex) coincide, as do ‘Invariance to block scaling’ (block scaling affects decomposition or not) and ‘# components’ (same number for all blocks or individual choice). When traversingthe tree from left or right, we therefore need to follow either agreen or a blue path through the ellipsoids, e.g., starting from ‘# components’ leads to choices ‘Different’ or ‘Same’. More indepth explanations of the concepts are found in the text above.

Figure 8.1 Figure (a)–(c) represent an L-structure/skeleton and Figure (d) a domino structure. See also notation and discussion of skeletonsin Chapter 3. The grey background in (b) and (c) indicates thatsome methods analyse the two modes sequentially. Different topologies, i.e., different ways of linking the blocks, associatedwith this skeleton will be discussed for each particular method.

Figure 8.2 Conceptual illustration of common information shared by the three blocks. The green colour represents the common column space of X₁ and X₂ and the red the common row space of X₁ and X₃. The orangein the upper corner of X₁ represents the joint commonness of the two spaces. The blue is the distinct parts of the blocks. This illustration is conceptual, there is no mathematical definition available yet about the commonness between row spaces and column spaces simultaneously.

Figure 8.3 Topologies for four different methods. The three first ((a), (b), (c)) are based on analysing the two modes in sequence. (a) PLS used for both modes (this section). (b) Correlation first approach (Section 8.5.4). (c) Using unlabelled data in calibration (Section 8.5.2). The topology in (d) will be discussed in Section 8.3. We refer to the main textfor more detailed descriptions. The dimensions of blocks are X₁ (I × N), X₂ (I × J), and X₃ (K × N). The topology in (a) corresponds to external preference mapping which will be given main attention here.

Figure 8.4 Scheme for information flow in preferencemapping with segmentation of consumers.

Figure 8.5 Preference mapping of dry fermented lamb sausages: (a) sensoryPCA scores and loadings (from X₂), and (b) consumer loadings presented for four segments determined by cluster analysis. Source: (Helgesen et al., 1997). Reproduced with permission from Elsevier.

Figure 8.6 Results from consumer liking of cheese. Estimatedeffects of the design factors in Table 8.3. Source: Almli et al. (2011). Reproduced with permission from Elsevier.

Figure 8.7 Results from consumer liking of cheese. (a) loadings from PCAof the residuals from ANOVA (using consumers as rows). Letters R/P in the loading plot refer to raw/pasteurised milk, and E/Srefer to everyday/special occasions. (b) PCA scores from the same analysis with indication of the two consumer segments. Source:Almli et al. (2011). Reproduced with permission from Elsevier.

Figure 8.8 Relations between segments and consumer characteristics. Source: (Almli et al., 2011). Reproduced with permission from Elsevier.

Figure 8.9 Topology for the extension. This is a combinationof a regression situation along the horizontal axisand a path model situation along the vertical axis.

Figure 8.10 L-block scheme with weights w’s. The w’sare used for calculating scores for deflation.

Figure 8.11 Endo-L-PLS results for fruit liking study. Source: (Martens et al., 2005). Reproduced with permission from Elsevier.

Figure 8.12 Classification CV-error as a function of the α valueand the number of L-PLS components. Source: (Sæbø et al., 2008b). Reproduced with permission from Elsevier.

Figure 8.13 (a) Data structure for labelled and unlabelled data. (b) Flow chart for how to utilise unlabelled data

Figure 8.14 Tree for selecting methods with complex data structures.

Figure 9.1 General setup for fusing heterogeneous data using representation matrices. The variables in the blocks X₁, X₂ and X₃ are represented with proper I × I representation matrices whichare subsequently analysed simultaneously with an IDIOMIX model generating scores and loadings. Source: Smilde et al. (2020). Reproduced with permission of John Wiley and Sons.

Figure 9.2 Score plots of IDIOMIX, OS-SCA and GSCA for the genomicsfusion; always score 3 (SC3) on the y-axes and score 1 (SC1)on the x-axes. The third component clearly differs among themethods. Source: Smilde et al. (2020). Licensed under CC BY 4.0.

Figure 9.3 True design used in mixture preparation (blue) versus the columnsof associated factor matrix corresponding to the mixture mode extracted by the BIBFA model (red) and the ACMTF model (red). Source: Acar et al. (2015). Reproduced with permission from IEEE.

Figure 9.4 Cross-validation results for the penalty parameter λ_bin of themutation block (left) and for the drug response, transcriptome,and methylation blocks (λ_quan, right) in the PESCA model.More explanation, see text. Adapted from Song et al. (2019).

Figure 9.5 Explained variances of the PESCA (a) and MOFA (b) model on the CCL data. From top to bottom: drug response, methylation, transcriptome,and mutation data. The values are percentages of explained variation. More explanation, see text. Adapted from Song et al. (2019).

Figure 9.6 From multiblock data to three-way data.

Figure 9.7 Decision tree for selecting an unsupervised method. For abbreviations,see the legend of Table 9.1. The furthest left leaf is empty but alsoCD methods can be used in that case. For more explanation, see text.

Figure 10.1 Results from multiblock redundancy analysis of theWine data, showing Y scores (u_r) and block-wiseweights for each of the four input blocks (A, B, C, D).

Figure 10.2 Pie chart of the sources of contribution to thetotal variance (arbitrary sector sizes for illustration).

Figure 10.3 Flow chart for the NI-SL method.

Figure 10.4 An illustration of SO-N-PLS, modelling a responseusing a two-way matrix, X₁, and a three-way array, X₂

Figure 10.5 Path diagram for a wine tasting study. The blocks repre-sent the different stages of a wine tasting experiment andthe arrows indicate how the blocks are linked. Source: (Næs et al., 2020). Reproduced with permission from Wiley.

Figure 10.6 Wine data. PCP plots for prediction of block D from blocks A, B, andC. Scores and loadings from PCA on the predicted y-values on top.The loadings from projecting the orthogonalised X-blocks (exceptthe first which is used as is) onto the scores at the bottom. Source:Romano et al. (2019). Reproduced with permission from Wiley & Sons.

Figure 10.7 An illustration of the multigroup setup, wherevariables are shared among X blocks and relatedto responses, Y, also sharing their own variables.

Figure 10.8 Decision tree for selecting a supervisedmethod. For more explanation, see text.

Figure 11.1 Output from use of scoreplot() on a pca object.

Figure 11.2 Output from use of loadingplot() on a cca object.

Figure 11.3 Output from use of scoreplot(pot.sca,labels = ”names”) (SCA scores in 2 dimensions).

Figure 11.4 Output from use of loadingplot(pot.sca,block = ”Sensory”, labels = ”names”) (SCA loadings in 2 dimensions).

Figure 11.5 Output from use of plot(can.statis$statis) (STATIS summary plot).

Figure 11.6 Output from use of scoreplot() (ASCA scores in 2 dimensions).

Figure 11.7 Output from use of scoreplot() (ASCA scores in 1 dimension).

Figure 11.8 Output from use of loadingplot() (ASCA scores in 2 dimensions).

Figure 11.9 Output from use of scoreplot() (block-scores).

Figure 11.10 Output from use of loadingplot() (block-loadings).

Figure 11.11 Output from use of scoreplot() andloadingweightplot() on an object from sMB-PLS.

Figure 11.12 Output from use of maage().

Figure 11.13 Output from use of maageSeq().

Figure 11.14 Output from use of loadingplot() on an sopls object.

Figure 11.15 Output from use of scoreplot() on an sopls object.

Figure 11.16 Output from use of scoreplot() on a pcp object.

Figure 11.17 Output from use of plot() on a cvanova object.

Figure 11.18 Output from use of scoreplot() on a popls object.

Figure 11.19 Output from use of loadingplot() on a popls object.

Figure 11.20 Output from use of loadingplot() on a rosa object.

Figure 11.21 Output from use of scoreplot() on a rosa object.

Figure 11.22 Output from use of image() on a rosa object.

Figure 11.23 Output from use of image() withparameter ”residual” on a rosa object.

Figure 11.24 Output from use of scoreplot() on an mbrda object.

Figure 11.25 Output from use of plot() on an lpls object.Correlation loadings from blocks are coloured andoverlaid each other to visualise relations across blocks.

Multiblock Data Fusion in Statistics and Machine Learning

Подняться наверх