Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 51

3.6.1 CoNN Architecture

A CoNN usually takes an order‐3 tensor as its input, for example, an image with H rows, W columns, and three channels (R, G, B color channels). Higher‐order tensor inputs, however, can be handled by CoNN in a similar fashion. The input then goes through a series of processing steps. A processing step is usually called a layer, which could be a convolution layer, a pooling layer, a normalization layer, a fully connected layer, a loss layer, etc. We will introduce the details of these layers later.

For now, let us give an abstract description of the CNN structure first. Layer‐by‐layer operation in a forward pass of a CoNN can be formally represented as x¹ → w¹ → x² → ⋯ → x^{L − 1} → w^{L − 1} → x^L → w^L → z. This will be referred to as the operation chain. The input is x¹, usually an image (an order‐3 tensor). It undergoes the processing in the first layer, which is the first box. We denote the parameters involved in the first layer’s processing collectively as a tensor w¹. The output of the first layer is x², which also acts as the input to the second‐layer processing. This processing continues until all layers in the CoNN have been processed, upon which x^L is outputted.

An additional layer, however, is added for backward error propagation, a method that learns good parameter values in the CoNN. Let us suppose the problem at hand is an image classification problem with C classes. A commonly used strategy is to output x^L as a C−dimensional vector, whose i‐th entry encodes the prediction (the posterior probability of x¹ comes from the i‐th class). To make x^L a probability mass function, we can set the processing in the (L − 1)‐th layer as a softmax transformation of x^{L − 1}. In the other applications, the output x^L may have other forms and interpretations.

The last layer is a loss layer. Let us suppose t is the corresponding target (ground truth) value for the input x¹; then a cost or loss function can be used to measure the discrepancy between the CoNN prediction x^L and the target t. For example, a simple loss function could be z = ‖t − x^L‖²/2, although more complex loss functions are usually used. This squared ℓ₂ loss can be used in a regression problem. In a classification problem, the cross‐entropy loss is often used. The ground truth in a classification problem is a categorical variable t. We first convert the categorical variable t to a C−dimensional vector . Now both t and x^L are probability mass functions, and the cross‐entropy loss measures the distance between them. Hence, we can minimize the cross‐entropy. The operation chain explicitly models the loss function as a loss layer whose processing is modeled as a box with parameters w^L. Note that some layers may not have any parameters; that is, wⁱ may be empty for some i. The softmax layer is one such example.

The forward run: If all the parameters of a CoNN model w¹, … , w^{L − 1} have been learned, then we are ready to use this model for prediction, which only involves running the CNN model forward, that is, in the direction of the arrows in the operational chain. Starting from the input x¹, we make it pass the processing of the first layer (the box with parameters w¹), and get x². In turn, x² is passed into the second layer, and so on. Finally, we achieve x^L ∈ ℝ^C, which estimates the posterior probabilities of x¹ belonging to the C categories. We can output the CNN prediction as arg max_i .

SGD: As before in this chapter, the parameters of a CoNN model are optimized to minimize the loss z; that is, we want the prediction of a CoNN model to match the ground‐truth labels. Let us suppose one training example x¹ is given for training such parameters. The training process involves running the CoNN network in both directions. We first run the network in the forward pass to get x^L to achieve a prediction using the current CoNN parameters. Instead of outputting a prediction, we need to compare the prediction with the target t corresponding to x¹, that is, continue running the forward pass until the last loss layer. Finally, we achieve a loss z. The loss z is then a supervision signal, guiding how the parameters of the model should be modified (updated). And the SGD method of modifying the parameters is wⁱ ← wⁱ − η∂z/∂wⁱ. Here, the ←sign implicitly indicates that the parameters wⁱ (of the i‐layer) are updated from time t to t + 1. If a time index t is explicitly used, this equation will look like (wⁱ)^{t + 1} = (wⁱ)^t − η∂z/∂(wⁱ)^t.

Error backpropagation: As before, the last layer’s partial derivatives are easy to compute. Because x^L is connected to z directly under the control of parameters w^L, it is easy to compute ∂z/∂w^L .This step is only needed when w^L is not empty. Similarly, it is also easy to compute If the squared ℓ₂ loss is used, we have an empty ∂z/∂w^L and ∂z/∂w^L = x^L − t. For every layer i, we compute two sets of gradients: the partial derivatives of z with respect to the parameters wⁱ and that layer’s input xⁱ. The term ∂z/∂wⁱ can be used to update the current (i‐th) layer’s parameters, while ∂z/∂xⁱ can be used to update parameters backward, for example, to the (i − 1)‐th layer. An intuitive explanation is that xⁱ is the output of the (i − 1)‐th layer and ∂z/∂xⁱ is how xⁱ should be changed to reduce the loss function. Hence, we could view ∂z/∂xⁱ as the part of the “error” supervision information propagated from z backward until the current layer, in a layer‐by‐layer fashion. Thus, we can continue the backpropagation process and use ∂z/∂xⁱto propagate the errors backward to the (i − 1)‐th layer. This layer‐by‐layer backward updating procedure makes learning a CoNN much easier. When we are updating the i‐th layer, the backpropagation process for the (i + 1)‐th layer must have been completed. That is, we must already have computed the terms ∂z/∂w^{i + 1} and ∂z/∂x^{i + 1}. Both are stored in memory and ready for use.Now our task is to compute ∂z/∂wⁱ and ∂z/∂xⁱ. Using the chain rule, we have

(3.79)

Since ∂z/∂x^{i + 1} is already computed and stored in memory, it requires just a matrix reshaping operation (vec) and an additional transpose operation to get ∂z/∂(vec(x^{i + 1})^T). As long as we can compute ∂vec(x^{i + 1})/∂(vec(wⁱ)^T) and vec(x^{i + 1})/∂(vec(xⁱ)^T), we can easily get Eq. (3.79). The terms ∂vec(x^{i + 1})/∂(vec(wⁱ)^T) and ∂vec(x^{i + 1})/∂(vec(xⁱ)^T) are much easier to compute than directly computing ∂z/∂(vec(wⁱ)^T) and ∂vec(x^{i + 1})/∂(vec(xⁱ)^T) because xⁱ is directly related to x^{i + 1} through a function with parameters wⁱ. The details of these partial derivatives will be discussed in the following sections.

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх