Читать книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs - Страница 26

Example 1.1: Metabolomics example: plant science data

This metabolomics example comes from a larger study in plant sciences (Caldana et al., 2011). The goal of the study was to investigate changes in metabolism and gene-expression of Arabidopsis related to growth under different light and temperature conditions. To this end, time-resolved experiments were performed. The design of the data set is shown in Figure 1.3. It is not a fully crossed design, but for each cell in the design gene-expression and metabolomics measurements were performed at 19 time points. We will only use the metabolomics measurements which comprised around 65 identified metabolites and use the part of 210C (the third line in the table below). This results in four blocks of data (21-D, 21-LL, 21-L and 21-HL) each consisting of 19 rows (time points) and 65 columns (measured metabolites). Hence, we only study the factors light and time (the factor temperature is kept constant).

Figure 1.3 Design of the plant experiment. Numbers in the top row refer to light levels (in μE m −2 sec −1); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.

A first impression of the variation in metabolite levels can be obtained by performing a principal component analysis (PCA) on the data, see Figure 1.4(a)), where we have concatenated all four blocks (21-D, 21-LL, 21-L and 21-HL) below each other. The colour coding is according to the light conditions and this figure shows that there is systematic variation associated with the factor light in the data. A more advanced analysis of this data is by using a multiblock data analysis method that takes into account the underlying experimental design, such as ANOVA-simultaneous component analysis (ASCA, see Chapter 6). Figure 1.4(b) shows the scores on the first ASCA interaction component and this clearly shows a time dependent contrast between dark and high light conditions. The original data set also comprises gene-expression measurements which makes the problem even more challenging.

Figure 1.4 Scores on the first two principal components of a PCA on the plant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.

Multiblock Data Fusion in Statistics and Machine Learning

Подняться наверх