Читать книгу Matrix and Tensor Decompositions in Signal Processing - Gérard Favier - Страница 15
I.5. With what cost functions and optimization algorithms?
ОглавлениеWe will now briefly describe the most common processing operations carried out with tensors, as well as some of the optimization algorithms that are used. It is important to first present the preprocessing operations that need to be performed. Preprocessing typically involves data centering operations (offset elimination), scaling of non-homogeneous data, suppression of outliers and artifacts, image adjustment (size, brightness, contrast, alignment, etc.), denoising, signal transformation using certain transforms (wavelets, Fourier, etc.), and finally, in some cases, the calculation of statistics of signals to be processed.
Preprocessing is fundamental, both to improve the quality of the estimated models and, therefore, of the subsequent processing operations, and to avoid numerical problems with optimization algorithms, such as conditioning problems that may cause the algorithms to fail to converge. Centering and scaling preprocessing operations are potentially problematic because they are interdependent and can be combined in several different ways. If data are missing, centering can also reduce the rank of the tensor model. For a more detailed description of these preprocessing operations, see Smilde et al. (2004).
For the processing operations themselves, we can distinguish between several different classes:
– supervised/non-supervised (blind or semi-blind), i.e. with or without training data, for example, to solve classification problems, or when a priori information, called a pilot sequence, is transmitted to the receiver for channel estimation;
– real-time (online)/batch (offline) processing;
– centralized/distributed;
– adaptive/blockwise (with respect to the data);
– with/without coupling of tensor and/or matrix models;
– with/without missing data.
It is important to distinguish batch processing, which is performed to analyze data recorded as signal and image sets, from the real-time processing required by wireless communication systems, recommendation systems, web searches and social networks. In real-time applications, the dimensionality of the model and the algorithmic complexity are predominant factors. The signals received by receiving antennas, the information exchanged between a website and the users and the messages exchanged between the users of a social network are time-dependent. For instance, a recommendation system interacts with the users in real-time, via a possible extension of an existing database by means of machine learning techniques. For a description of various applications of tensors for data mining and machine learning, see Anandkumar et al. (2014) and Sidiropoulos et al. (2017).
Tensor-based processings lead to various types of optimization algorithm as follows:
– constrained/unconstrained optimization;
– iterative/non-iterative, or closed-form;
– alternating/global;
– sequential/parallel.
Furthermore, depending on the information that is available a priori, different types of constraints can be taken into account in the cost function to be optimized: low rank, sparseness, non-negativity, orthogonality and differentiability/smoothness. In the case of constrained optimization, weights need to be chosen in the cost function according to the relative importance of each constraint and the quality of the a priori information that is available.
Table I.4 presents a few examples of cost functions that can be minimized for the parameter estimation of certain third-order tensor models (CPD, Tucker, coupled matrix Tucker (CMTucker) and coupled sparse tensor factorization (CSTF)), for the imputation of missing data in a tensor and for the estimation of a sparse data tensor with a low-rank constraint expressed in the form of the nuclear norm of the tensor.
REMARK I.1.– We can make the following remarks:
– the cost functions presented in Table I.4 correspond to data fitting criteria. These criteria, expressed in terms of tensor and matrix Frobenius norms (||.||F), are quadratic in the difference between the data tensor χ and the output of CPD and TD models, as well as between the data matrix Y and a matrix factorization model, in the case of the CMTucker model. They are trilinear and quadrilinear, respectively, with respect to the parameters of the CPD and TD models to be estimated, and bilinear with respect to the parameters of the matrix factorization model;
– for the missing data imputation problem using a CPD or TD model, the binary tensor , which has the same size as χ, is defined as:
[I.2]
The purpose of the Hadamard product (denoted ) of , with the difference between χ and the output of the CPD and TD models, is to fit the model to the available data only, ignoring any missing data for model estimation. This imputation problem, known as the tensor completion problem, was originally dealt with by Tomasi and Bro (2005) and Acar et al. (2011a) using a CPD model, followed by Filipovic and Jukic (2015) using a TD model. Various articles have discussed this problem in the context of different applications. An overview of the literature will be given in the next volume;
Table I.4. Cost functions for model estimation and recovery of missing data
ProblemsData | |
Estimation | Cost functions |
CPD | |
TD | |
CMTucker | |
CSTF | |
Imputation | Cost functions |
CPD | |
TD | |
Imputation with low-rank constraint | Cost functions |
CPD | |
TD |
– for the imputation problem with the low-rank constraint, the term χ in the cost function replaces the low-rank constraint with the nuclear norm of χ, since the function rank (χ) is not convex, and the nuclear norm is the closest convex approximation of the rank. In Liu et al. (2013), this term is replaced by where Xn represents the mode-n unfolding of χ7;
– in the case of the CMTucker model, the coupling considered here relates to the first modes of the tensor χ and the matrix Y of data via the common matrix factor A.
Coupled matrix and tensor factorization (CMTF) models were introduced in Acar et al. (2011b) by coupling a CPD model with a matrix factorization and using the gradient descent algorithm to estimate the parameters. This type of model was used by Acar et al. (2017) to merge EEG and fMRI data with the goal of analyzing brain activity. The EEG signals are modeled with a normalized CPD model (see Chapter 5), whereas the fMRI data are modeled with a matrix factorization. The data are coupled through the subjects mode (see Table I.1). The cost function to be minimized is therefore given by:
[I.3]
where the column vectors of the matrix factors (A, B, C) have unit norm, Σ is a diagonal matrix whose diagonal elements are the coefficients of the vector σ and α > 0 is a penalty parameter that allows the importance of the sparseness constraints on the weight vectors (g, σ) to be increased or decreased, modeled by means of the l1 norm. The advantage of merging EEG and fMRI data with the criterion [I.3] is that the acquisition and observation methods are complementary in terms of resolution, since EEG signals have a high temporal resolution but low spatial resolution, while fMRI imaging provides high spatial resolution;
– in the case of the CSTF model (Li et al. 2018), the tensor of high-resolution hyperspectral images (HR-HSI) is represented using a third-order Tucker model that has a sparse core with the following modes: space (width) × space (height) × spectral bands. The matrices denote the dictionaries for the width, height and spectral modes, composed of nw, nh and ns atoms, respectively, and the core tensor contains the coefficients relative to the three dictionaries. The matrices W∗, H∗ and S∗ are spatially and spectrally subsampled versions with respect to each mode. The term λ is a regularization parameter for the sparseness constraint on the core tensor, expressed in terms of the l1 norm of .
7 See definition [3.41] of the unfolding Xn, and definitions [1.65] and [1.67] of the Frobenius norm (||.||F) and the nuclear norm (||.||∗) of a matrix; for a tensor, see section 3.16.
The criteria listed in Table I.4 can be globally minimized using a nonlinear optimization method such as a gradient descent algorithm (with fixed or optimal step size), or the Gauss–Newton and Levenberg–Marquardt algorithms, the latter being a regularized form of the former. In the case of constrained optimization, the augmented Lagrangian method is very often used, as it allows the constrained optimization problem to be transformed into a sequence of unconstrained optimization problems.
The drawbacks of these optimization methods include slow convergence for gradient-type algorithms and high numerical complexity for the Gauss–Newton and Levenberg–Marquardt algorithms due to the need to compute the Jacobian matrix of the criterion w.r.t. the parameters being estimated, as well as the inverse of a large matrix.
Alternating optimization methods are therefore often used instead of a global optimization w.r.t. all matrix and tensor factors to be estimated. These iterative methods perform a sequence of separate optimizations of criteria linear in each unknown factor while fixing the other factors with the values estimated at previous iterations. An example is the standard ALS (alternating least squares) algorithm, presented in Chapter 5 for estimating PARAFAC models. For constrained optimization, the alternating direction method of multipliers (ADMM) is often used (Boyd et al. 2011).
To complete this introductory chapter, let us outline the key knowledge needed to employ tensor tools, whose presentation constitutes the main objective of this second volume:
– arrangement (also called reshaping) operations that express the data tensor as a vector (vectorization), a matrix (matricization), or a lower order tensor by combining modes; conversely, the tensorization and Hankelization operations allow us to construct tensors from data contained in large vectors or matrices;
– tensor operations such as transposition, symmetrization, Hadamard and Kronecker products, inversion and pseudo-inversion;
– the notions of eigenvalue and singular value of a tensor;
– tensor decompositions/models, and their uniqueness properties;
– algorithms used to solve dimensionality reduction problems and, hence, best low-rank approximation, parameter estimation and missing data imputation. This algorithmic aspect linked to tensors will be explored in more depth in Volume 3.