Читать книгу Multiblock Data Fusion in Statistics and Machine Learning - Tormod Næs - Страница 34
Example 1.3: Chemistry example: Raman spectroscopy data
ОглавлениеThe data set was first published in a study containing both Raman and near infrared spectroscopy measurements of emulsions (Afseth et al., 2005). For the Raman data, 1096 Raman shifts, from 1770 cm −1 to 675 cm −1, were recorded for 69 emulsions containing a mixture of proteins, water, and fats (see Figure 1.6). Two reference values are used as responses: polyunsaturated fatty acids (PUFA) as percentage of total sample weight (0.3–11.5%) and as percentage of fats in sample (2.2–61.6%). The reference values have a correlation of R=0.73, i.e. R2=0.54, meaning that around half of the variation in PUFA content is due to the variation in total fat content. The aim of the original study was to be able to quantify the PUFA percentages using only spectroscopy to enable quick, cheap, and non-destructive measurements.
Figure 1.6 Plot of the Raman spectra used in predicting the fat content. The dashed lines show the split of the data set into multiple blocks.
In this book, we will concentrate on the Raman block as this dominated completely in a previous multiblock data analysis study (Liland et al., 2016), and rather split it into suitable wavelength regions, here splitting at 1350 cm −1 and 1100 cm −1. This is done to explore the predictive power of the different wavelength regions. This data set will be analysed using several of the supervised methods in this book to see what is emphasised by each of them. In general, we see that the predictive models mostly leverage the variables corresponding to molecular vibrations associated with lipids and degrees of saturation, and that these models can reproduce the reference values with high precision.