Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 71
4.4.2 SVR
ОглавлениеThe basic idea: Let {(x1, y1), . . . , (xl, yl)} ⊂ X × ℝ, be a given training data, where X denotes the space of the input patterns (e.g., X = ℝd). In ε‐SV regression, the objective is to find a function f(x) that has at most a deviation of ε from the actually obtained targets yi for all the training data, and at the same time is as flat as possible. In other words, we do not care about errors as long as they are less than ε, but will not accept any deviation larger than this. We begin by describing the case of linear functions, f, taking the form
where 〈 , 〉 denotes the dot product in X. Flatness in the case of Eq. (4.44) means that one seeks a small w. One way to ensure this is to minimize the norm, that is, ‖w‖2 = 〈w, w〉. We can write this problem as a convex optimization problem:
The tacit assumption in Eq. (4.45) was that such a function f actually exists that approximates all pairs (xi, yi) with ε precision, or in other words, that the convex optimization problem is feasible. Sometimes, the problem might not have a solution for the given ε, or we also may want to allow for some errors. In analogy with the “soft margin” (see Figure 2.7 of Chapter 2), one can introduce slack variables ξi, to cope with possibly infeasible constraints of the optimization problem: Eq. (4.45). So we have
The constant C > 0 determines the trade‐off between the flatness of f and the amount up to which deviations larger than ε are tolerated. This corresponds to dealing with a so‐called ε‐insensitive loss function |ξ|ε described by
Figure 4.6 Illustration of the soft margin for a linear support vector machine (SVM).
(4.47)
Figure 2.7 of Chapter 2 is modified to reflect better the definitions introduced in this section and presented as Figure 4.6. Often, the optimization problem, Eq. (4.46), can be solved more easily in its dual formulation. The dual formulation provides also the key for extending SV machine to nonlinear functions.
Lagrange function: The key idea here is to construct a Lagrange function from the objective function (primal objective function) and the corresponding constraints, by introducing a dual set of variables [95]. It can be shown that this function has a saddle point with respect to the primal and dual variables at the solution. So we have
Here, L is the Lagrangian, and ηi, αi, are Lagrange multipliers. Hence, the dual variables in Eq. (4.48) have to satisfy positivity constraints, that is, where by , we jointly refer to αi and In the saddle point, the partial derivatives of L with respect to the primal variables have to vanish; that is, , and = 0. Substituting these conditions into Eq. (4.48) yields the dual optimization problem:
In deriving Eq. (4.49), we already eliminated the dual variables ηi, through condition = 0, which can be reformulated as . Equation can be rewritten as , giving
This is the so‐called support vector expansion; that is, w can be completely described as a linear combination of the training patterns xi . In a sense, the complexity of a function’s representation by SVs is independent of the dimensionality of the input space X, and depends only on the number of SVs. Moreover, note that the complete algorithm can be described in terms of dot products between the data. Even when evaluating f(x), we need not compute w explicitly. These observations will come in handy for the formulation of a nonlinear extension.
Computing b : Parameter b can be computed by exploiting the so‐called Karush−Kuhn−Tucker (KKT) conditions stating that at the point of the solution the product between dual variables and constraints has to vanish, giving αi(ε + ξi − yi + 〈w, xi〉 + b) = 0, , (C − αi)ξi = 0 and This allows us to draw several useful conclusions:
1 Only samples (xi, yi) with corresponding lie outside the ε ‐insensitive tube.
2 ; that is, there can never be a set of dual variables αi , that are both simultaneously nonzero. This allows us to conclude that(4.51) (4.52)
In conjunction with an analogous analysis on , we have for b
(4.53)
Kernels: We are interested in making the SV algorithm nonlinear. This, for instance, could be achieved by simply preprocessing the training patterns xi by a map Φ: X → F into some feature space F, as already described in Chapter 2, and then applying the standard SV regression algorithm. Let us have a brief look at the example given in Figure 2.8 of Chapter 2. We had (quadratic features in ℝ2) with the map Φ: ℝ2 → ℝ3 with Φ . It is understood that the subscripts in this case refer to the components of x ∈ ℝ2. Training a linear SV machine on the preprocessed features would yield a quadratic function as indicated in Figure 2.8. Although this approach seems reasonable in the particular example above, it can easily become computationally infeasible for both polynomial features of higher order and higher dimensionality.
Implicit mapping via kernels: Clearly this approach is not feasible, and we have to find a computationally cheaper way. The key observation [96] is that for the feature map of the above example we have
(4.54)
As noted in the previous section, the SV algorithm only depends on the dot products between patterns xi . Hence, it suffices to know k(x, x′) ≔ 〈Φ(x), Φ(x′)〉 rather than Φ explicitly, which allows us to restate the SV optimization problem:
(4.55)
Now the expansion for f in Eq. (4.50) may be written as and
The difference to the linear case is that w is no longer given explicitly. Also, note that in the nonlinear setting, the optimization problem corresponds to finding the flattest function in the feature space, not in the input space.