Читать книгу Congo Basin Hydrology, Climate, and Biogeochemistry - Группа авторов - Страница 76
5.2.6. Statistical Analysis and Modeling
ОглавлениеThe statistical analysis and decomposition of SPEI and TWS into temporal and spatial patterns were based on the principal component analysis (PCA, e.g., Jolliffe, 2002). The need to localize hydro‐climatic signals is increasing due to growing multiple climate signals around the globe (e.g., Ndehedehe et al., 2017b). This has triggered numerous robust applications of multivariate methods in the spatiotemporal analysis of drought patterns and multi‐resolution data (see, e.g., Agutu et al., 2017; Bazrafshan et al., 2014; Ivits et al., 2014; Ndehedehe et al., 2016). To understand the influence of global climate on Congo’s hydrology, the support vector machine regression model (SVMR, Vapnik, 1995) was used to assess the influence of climate on the Congo Basin hydrology. The support vector machine (Cortes & Vapnik, 1995) algorithm was extended by Vapnik (1995) for regression using an ε‐insensitive loss function. The SVMR concept is based on the computation of a linear regression function in a high‐dimensional feature space in which the input data (xi) are mapped through a non‐linear function (e.g., Okwuashi & Ndehedehe, 2017). This mapping is warranted because most of the time, the relationship between a multidimensional input vector x and the output y is unknown and could be non‐linear (e.g., Wauters & Vanhoucke, 2014). After finding a linear hyperplane that fits the multidimensional input vectors to output values, the SVMR predict future output values that are contained in a validation set (e.g., Okwuashi & Ndehedehe, 2017; Smola & Schölkopf, 2004; Vapnik, 1995; Wauters & Vanhoucke, 2014). Assuming the set of data points X = (x i , p i ); i = 1.., n with x i , being the predictand data point i, p i the actual value, and n the number of data points. The linear SVMR function f(x) takes the form (e.g., Vapnik, 1995):
The assumed linear parameterization in equation 5.2 bears similarity to a linear regression model. That is because the predicted value, f(x), depends on a slope w and an intercept b. However, the goal of the SVMR is to identify a function f(x) that has a maximum deviation ε from the target values p i and has a maximum margin for all training patterns xi. In other words, a balance between learning the relation between inputs and outputs while maintaining a good generalization behavior is targeted. As highlighted further in Wauters and Vanhoucke (2014), too much focus on minimizing training errors may lead to overfitting. Hence, a pre‐specified penalty value (C) is introduced as a trade‐off to create the balanced between generalization and good training. That is, C regulates the trade‐off between the regularization term (½ ‖w‖2) and the training accuracy in the formulation below as (e.g., Vapnik, 1995; Wauters & Vanhoucke, 2014),
(5.3)
where the compound risk caused by training errors and model complexity is given as ς. Equation 5.2 provides the estimated values for w and b and comprises the empirical risk measured by the ε‐insensitive loss function, Lε, and the regularization term ½ ‖w‖2, which describes the model complexity (Cortes & Vapnik, 1995; Wauters & Vanhoucke, 2014). Prior to modeling the response of discharge to climate using the SVMR, a regularization approach where the SST is compressed through a PCA‐based orthogonalization was employed (e.g., Barnett & Preisendorfer, 1987; Bretherton et al., 1992; Ndehedehe et al., 2018b). This resulted in significant modes of SST variability from the respective oceans, which were then used as predictands in the SVMR model. Specifically, a linear SVM regression model was trained to fit the data. The SVMR technique evaluates each run of the experiment using regression, by partitioning the data internally into training, validation, and testing components (i.e., 65% of the total data). The remaining 35% of the observed data were thereafter used for forward prediction based on the hold‐out method of cross‐validation (e.g., Haley, 2017). The stratified partitioning of the data using this approach ensures that each partition includes similar amount of observations from each group. The predicted and observed discharge were then compared using Pearson correlation.