Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 34

2.2.3 Dimensionality Reduction Techniques

In this section, we provide an overview of the mathematical properties and foundations of the various dimensionality reduction techniques [22–24]

There are several dimensionality reduction techniques specifically designed for time series. These methods specifically exploit the frequential content of the signal and its usual sparseness in the frequency space. The most popular methods are those based on wavelets [25, 26], and a distant second is empirical mode decomposition [27, 28] (the reader is referred to the references above for further details). We do not cover these techniques here since they are not usually applied for the general‐purpose dimensionality reduction of data. From a general point of view, we may say that wavelets project the input time series onto a fixed dictionary (see Section 2.3). This dictionary has the property of making the projection sparse (only a few coefficients are sufficiently large), and the dimensionality reduction is obtained by setting most coefficients (the small ones) to zero. Empirical mode decomposition instead constructs a dictionary specially adapted to each input signal.

To maintain the consistency of this review, we do not cover those dimensionality reduction techniques that take into account the class of observations; that is, there are observations from a class A of objects, observations from a class B, … and the dimensionality reduction technique should maintain, to the extent possible, the separability of the original classes. Fisher’s Linear Discriminant Analysis ) was one of the first techniques to address this issue [29, 30]. Many other works have followed since then; for the most recent works and for a bibliographical review, see [31, 35]. Next, we will focus on vector quantization and mixture models and PCA, which was already introduced to some extent in the previous section.

In the following, we will refer to the observations as input vectors x, whose dimension is M. We will assume that we have N observations, and we will refer to the nth observation as x_n. The whole dataset of observations will be X, whereas X will be a M × N matrix with all the observations as columns. Note that non‐bold small letters represent vectors (x), whereas capital, non‐bold letters (X) represent matrices.

The goal of the dimensionality reduction is to find another representation χ of a smaller dimension m such that as much information as possible is retained from the original set of observations. This involves some transformation operators from the original vectors onto the new vectors, χ = T(x). These projected vectors are sometimes called feature vectors, and the projection of x_n will be denoted as χ _n. There might not be an inverse for this projection, but there must be a way of recovering an approximate value of the original vector, , such that .

An interesting property of any dimensionality reduction technique is to consider its stability. In this context, a technique is said to be ε‐stable if for any two input data points, x₁ and x₂, the following inequality holds [36]: . Intuitively, this equation implies that Euclidean distances in the original input space are relatively conserved in the output feature space.

Methods based on statistics and information theory: This family of methods reduces the input data according to some statistical or information theory criterion. Somehow, the methods based on information theory can be seen as a generalization of the ones based on statistics in the sense that they can capture nonlinear relationships between variables, can handle interval and categorical variables at the same time, and many of them are invariant to monotonic transformations of the input variables.

Vector quantization and mixture models: Probably the simplest way of reducing dimensionality is by assigning a class {among a total of K classes) to each one of the observations x_n. This can be seen as an extreme case of dimensionality reduction in which we go from M dimensions to 1 (the discrete class label χ). Each class, χ, has a representative which is the average of all the observations assigned to that class. If a vector x_n has been assigned to the χ_n‐th class, then its approximation after the dimensionality reduction is simply , (see Figure 2.19).

The goal is thus to find the representatives , and class assignment u_χ(x)(u_χ(x) is equal to 1 if the observation x is assigned to the χ‐th class, and is 0 otherwise) such that is minimized. This problem is known as vector quantization or k‐means, already briefly introduced in Section 2.1. The optimization of this goal function is a combinatorial problem, although there are heuristics to cut down its cost [37, 38]. An alternative formulation of the k‐means objective function is subject to U^t U = I and u_ij ∈ {0, 1} {i.e. that each input vector is assigned to one and only one class). In this expression, W is a M × m matrix with all representatives as column vectors, U is an m × N matrix whose ij‐th entry is 1 if the j‐th input vector is assigned to the i‐th class, and denotes the Frobenius norm of a matrix. This intuitive goal function can be put in a probabilistic framework. Let us assume we have a generative model of how the data is produced. Let us assume that the observed data are noisy versions of K vectors x_χ which are equally likely a priori. Let us assume that the observation noise is normally distributed with a spherical covariance matrix = σ² I. The likelihood of observing x_n having produced x_χ is

Figure 2.19 Black circles represent the input data, x_n; gray squares represent class representatives,

With our previous definition of u_χ(x), we can express it as

The log likelihood of observing the whole dataset x_n{n = 1, 2, …, N) after removing all constants is We thus see that the goal function of vector quantization J_VQ produces the maximum likelihood estimates of the underlying x_l vectors.

Under this generative model, the probability density function of the observations is the convolution of a Gaussian function and a set of delta functions located at the x_χ vectors, that is, a set of Gaussians located at the x_χ vectors. The vector quantization then is an attempt to find the centers of the Gaussians forming the probability density function of the input data. This idea has been further pursued by Mixture Models, which are a generalization of vector quantization in which, instead of looking only for the means of the Gaussians associated with each class, we also allow each class to have a different covariance matrix ∑_χ and different a priori probability π_χ. The algorithm looks for estimates of all these parameters by Expectation–Maximization, and at the end produces for each input observation x_n, the label χ of the Gaussian that has the maximum likelihood of having generated that observation.

This concept can be extend and, instead of making a hard class assignment, a fuzzy class assignment can be used by allowing 0 ≤ u_χ(x) ≤ 1 and requiring for all x. This is another vector quantization algorithm called fuzzy k ‐means. The k‐means algorithm is based on a quadratic objective function, which is known to be strongly affected by outliers. This drawback can be alleviated by taking the l₁ norm of the approximation errors and modifying the problem to J_K‐medians = subject to U^t U = I and u_ij ∈ {0, 1}. A different approach can be used to find data representatives less affected by outliers, which we may call robust vector quantization, , where Φ(x) is a function less sensitive to outliers than Φ(x) = x, for instance, Φ(x) = x^α with α about 0.5.

Principal component analysis (PCA): Introduced in Section 2.1, is by far one of the most popular algorithms for dimensionality reduction [39–42]. Given a set of observations x, with dimension M (they lie in ℝ^M), PCA is the standard technique for finding the single best (in the sense of least‐square error) subspace of a given dimension, m. Without loss of generality, we may assume the data is zero‐mean and the subspace to fit is a linear subspace (passing through the origin).

This algorithm is based on the search for orthogonal directions explaining as much variance of the data as possible. In terms of dimensionality reduction, it can be formulated [43] as the problem of finding the m orthonormal directions w_i minimizing the representation error . In this objective function, the reduced vectors are the projections χ = (〈w₁, x〉, …, 〈w_m, x〉)^t This can be much more compactly written as χ = W^tx, where W is a M × m matrix whose columns are the orthonormal directions w_i {or equivalently W^t W = I). The approximation to the original vectors is given by x〉w_i, or equivalently, W χ .

In Figure 2.15, we have shown a graphical representation of a PCA transformation in only two dimensions (x ∈ ℝ²) with a slightly different notation (χ represented by z). As can be seen from Figure 2.15, the variance of the data in the original data space is best captured in the rotated space given by vectors =W^tx. χ₁ is the first principal component, and it goes in the direction of most variance; χ₂ is the second principal component, and is orthogonal to the first direction and goes in the second direction with the most variance (in ℝ² there is not much choice, but in the general case, ℝ^M, there is). Observe that without loss of generality the data is centered about the origin of the output space. We can rewrite the objective function as

Note that the class membership matrix (U in vector quantization) has been substituted in this case by W^t X, which in general can take any positive or negative value. It, thus, has lost its membership meaning and simply constitutes the weights of the linear combination of the column vectors of W that better approximate each input x. Finally, the PCA objective function can also be written as

J_PCA = Tr{W^t ∑_X W} [44], where is the covariance matrix of the observed data. The PCA formulation has also been extended to complex‐valued input vectors [45]; the method is called non‐circular PCA.

The matrix projection of the input vectors onto a lower‐dimensional space ( χ = W^tx) is a widespread technique in dimensionality reduction. As an illustration, let us look at the following example [46]:

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх