Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 72
4.4.3 Combination of Fuzzy Models and SVR
ОглавлениеGiven observation data from an unknown system, data‐driven methods aim to construct a decision function f(x) that can serve as an approximation of the system. As seen from the previous sections, both fuzzy models and SVR are employed to describe the decision function. Fuzzy models characterize the system by a collection of interpretable if‐then rules, and a general fuzzy model that consists of a set of rules with the following structure will be used here:
(4.57)
Here, parameter d is the dimension of the antecedent variables x = [x1, x2, … , xd]T, Ri is the i‐th rule in the rule base, and Ai1 , … , Aipx are fuzzy sets defined for the respective antecedent variable. The rule consequent gi(x, βi) is a function of the inputs with parameters βi . Parameter c is the number of fuzzy rules. By modification of Eq. (4.41) to product form and Eq. (4.43), the decision function in terms of the fuzzy model by fuzzy mean defuzzification becomes
where is the antecedent firing strength, and μi(xj) is the membership of xj in the fuzzy set Ai1. By the generalization of Eqs. (4.46) and (4.56), SVR is formulated as minimization of the following functional:
where C is the regularization parameter. The solution of (4.59) is used to determine the decision function f(x), and is given by (see Eq. (4.56)):
where are subjected to constraints 0 ≤ αi , c is the number of support vectors, and the kernel k(x, xi) = 〈Φ(x), Φ(xi)〉 is an inner product of the images in the feature space. The model given by Eq. (4.60) is referred to as the SVR model.
Motivated by the underlying concept of granularity, both the kernel in Eq. (4.60) and the fuzzy membership function in Eq. (4.59) are information granules. The kernel is a similarity measure between the support vector and the non‐support vector in SVR, and fuzzy membership functions associated with fuzzy sets are essentially linguistic granules, which can be viewed as linked collections of fuzzy variables drawn together by the criterion of similarity. Hence, [97–99] regarded kernels as the Gaussian membership function of the t‐norm‐based algebra product
and incorporated SVR in FM. In Eq. (4.61), xi = [xi1, xi2, . . . xid]T denotes the support vector in the framework of SVR, but xij is referred as to the center of the Gaussian membership function. Parameter σj is a hyperparameter of the kernel, whereas it represents the dispersion of the Gaussian membership function in fuzzy set theory.
Fuzzy model based on SVR: Combining the fuzzy model with SVR, we can build a fuzzy system that can use the advantages that each technique offers, so the trade‐off could be well balanced under this combination. Such a model is developed to extract support vectors for generating fuzzy rules, so c is equal in both Eqs. (4.58) and (4.60). Sometimes there are too many support vectors, which will lead to a redundant and complicated rule base even though the model performance is good. Alternatively, we could reduce the number of support vectors and utilize them to generate a transparent fuzzy model. Simultaneously, we make the fuzzy model retain the original performance of the SVR model, and learn the experience already acquired from SVR. In such a way, an experience‐oriented learning algorithm is created. So, a simplification algorithm is employed to obtain reduced‐set vectors instead of support vectors for constructing the fuzzy model, and the parameters are adjusted by a hybrid learning mechanism considering the experience of the SVR model on the same training data set. The obtained fuzzy model retains the acceptable performances of the original SVR solutions, and at the same time possesses high transparency. This enables a good compromise between the interpretability and accuracy of the fuzzy model.
Constructing interpretable kernels: Besides Gaussian kernel functions such as Eq. (4.61), there are some other common forms of membership functions:
The triangle membership function:
The generalized bell‐shaped membership function:
The trapezoidal‐shaped membership function:
where b is the center of the membership functions, and parameters γ, a, and c are mean values of the dispersion for the three examples of membership functions.. Could the above functions also be used for constructing an admissible Mercer kernel? Note that they are translation invariant functions, so the multidimensional function created by these kinds of functions based on product t‐norm operator is also translation invariant. Furthermore, if we regard the multidimensional functions as translation invariant kernels, then the following theorem can be used to check whether these kernels are admissible Mercer kernels:
A translation‐invariant kernel k(x, xi) = k(x − xi) is an admissible Mercer kernel if and only if the Fourier transform
is non‐negative [100]. For the case of the triangle and generalized bell‐shaped membership functions, the Fourier transform is respectively as follows:
and
Since both of them are non‐negative, we can construct Mercer kernels with triangle and generalized bell‐shaped membership functions. But the Fourier transform in the case of the trapezoidal‐shaped membership function is
which is not always non‐negative. In conclusion, the kernel can also be regarded as a product‐type multidimensional triangle or a generalized bell‐shaped membership function, but not the trapezoidal‐shaped one. The notation is also considered as a fuzzy logical operator, namely, the t‐norm‐based algebra product [101, 102]. The obtained Mercer kernels could be understood by means of the conjunction (and) used in the previous sections. Thus, one can assign some meanings to the constructed Mercer kernels to obtain linguistic interpretability.
Experience‐oriented FM via reduced‐set vectors: Given n training data , the goal of experience‐oriented FM is to construct a fuzzy model such as Eq. (4.58) that has a good trade‐off between interpretability and accuracy. We examine the trade‐off using the proposed algorithm with two objectives: to minimize the number of fuzzy rules and maximize the accuracy, that is, the approximation and generalization performance.
Given the good performance of SVR, it is reasonable to share the successful experience of SVR in FM. So, SVR with Mercer kernels is employed to generate the initial fuzzy model and the available experience on the training data. It is also expected that a reduction in the number of rules could make the resulting rule base more interpretable and transparent. Thus, a simplification algorithm is introduced to generate reduced‐set vectors for simplifying the structure of the initial fuzzy model, and at the same time the parameters of the derived simplified model are adjusted by a hybrid learning algorithm including the linear ridge regression algorithm and the gradient descent method based on a new performance measure. As a start, let us reformulate Eq. (4.60) through a simple equivalent algebra transform to obtain
where c is the number of support vectors, the parameters of the function and Θ ′ = denote the kernel parameters. Obviously, if k(x, xi) is created by a Gaussian, triangle, or generalized bell‐shaped membership function, then Eq. (4.62) is consistent with the TS fuzzy inference structure, and exhibits good performance under the optimal model selection procedures. However, c, that is, the number of support vectors, usually becomes quite large so that the fuzzy model suffers from being uninterpretable. To compensate for this drawback, c should be replaced by a smaller cf. Thus, a simplified fuzzy model is used:
where c′ is the number of rules, βi = (θi , θ0i , zj , Θ) are the consequent parameters of the rule (xj, zij, Θj)θi + θ0i , and Θ = (Θ1, … , Θd) denote the dispersion parameters of the membership functions. The consequent parameter b does not remain unchanged anymore, and it is replaced by θ0i in order to increase the adjustment ability of each consequent.
Rather than directly extracting support vectors to generate fuzzy rules, the FM problem is solved by learning the parameters in Eq. (4.63) while considering the experience of Eq. (4.62). It is expected that Eq. (4.63) would be able to describe the input–output behavior in the same way as Eq. (4.62). However, the experience acquired heavily depends on the selection of the hyperparameters [103, 104]. Improper selection of these hyperparameters may result in bad performance and bring on useless experience and information. Here, some selection methods are suggested:
1 The regularization parameter C can be given by following prescription according to the range of output values of the training data [103] : , where and σy are the mean and the standard deviation of the training data y, and n is the number of training data.
2 v ‐SVR is employed instead of ε ‐SVR, since it is easy to determine of the number of support vectors by the adjustment of v. The adjustment of parameter v can be determined by an asymptotically optimal procedure, and the theoretically optimal value for Gaussian noise is 0.54 [104]. For the kernel parameter Θ′, the k ‐fold cross‐validation method is utilized [104].
Reduced‐set vectors: In order to share the experience, we are interested in constructing Eq. (4.63) such that the original Eq. (4.62) is approximated. In the following, k(x, xi) is written as k′(x, xi), considering its kernel parameters Θ′ in Eq. (4.62). Similarly, in Eq. (4.63), (xj, zij, Θj) is replaced by k(x, zi) according to the kernels constructed in the previous paragraph. Then, let G(x) . With this we have
(4.64)
If we let the consequent parameter θ0i be G(x0i), we have
For a smaller upper bound, we assume that Θ′ = Θ. Then, according to the Cauchy–Schwartz inequality, the right side of the above inequality is simplified to
where , giving
If we use the notation ‖·‖∞ as we can write
It is expected to obtain a small ρ in order to make a good approximation.
Hybrid learning algorithm: Parameter ρ in Eq. (4.65) is not small enough in many situations. In order to obtain a better approximation, a hybrid learning algorithm including a linear ridge regression algorithm and the gradient descent method can be used to adjust Θ and θ0i according to the experience of Eq. (4.62) [105]. The performance measure is defined as
where α is a weighted parameter that defines the relative trade‐off between the squared error loss and the experienced loss, and e1(xk) = yk − fFM(xk), e2(xk) = fSVR(xk) − fFM(xk). Thus, the error between the desired output and actual output is characterized by the first term, and the second term measures the difference between the actual output and the experienced output of SVR. Therefore, each epoch of the hybrid learning algorithm is composed of a forward pass and a backward pass which implement the linear ridge regression algorithm and the gradient descent method in E over parameters Θ and θ0i . Here, θ0i are identified by the linear ridge regression in the forward pass. In addition, it is assumed that the Gaussian membership function is employed, and thus Θj is referred as to σj . Using Eqs. (4.62) and (4.63), and defining
(4.67)
then, at the minimum point of Eq. (4.66) all derivatives with respect to θ0i should vanish:
(4.68)
These conditions can be rewritten in the form of normal equations:
(4.69)
where m = 1, … , c′. This is a standard problem that forms the grounds for linear regression, and the most well‐known formula for estimating θ = [θ01 θ02 ⋯ θ0c′]T uses the ridge regression algorithm:
(4.70)
where δ is a positive scalar, and
(4.71)
where ,
ψ(xk) = [φ1(xk), φ2(xk), ⋯, φc′(xk)]T, k = 1, … n. In the backward pass, the error rates propagate backward and σj are updated by the gradient descent method. The derivatives with respect to are calculated from
(4.72)
where
(4.73)
and σj are updated as
(4.74)
where η is the constant step size.