Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 51
3.6.1 CoNN Architecture
ОглавлениеA CoNN usually takes an order‐3 tensor as its input, for example, an image with H rows, W columns, and three channels (R, G, B color channels). Higher‐order tensor inputs, however, can be handled by CoNN in a similar fashion. The input then goes through a series of processing steps. A processing step is usually called a layer, which could be a convolution layer, a pooling layer, a normalization layer, a fully connected layer, a loss layer, etc. We will introduce the details of these layers later.
For now, let us give an abstract description of the CNN structure first. Layer‐by‐layer operation in a forward pass of a CoNN can be formally represented as x1 → w1 → x2 → ⋯ → xL − 1 → wL − 1 → xL → wL → z. This will be referred to as the operation chain. The input is x1, usually an image (an order‐3 tensor). It undergoes the processing in the first layer, which is the first box. We denote the parameters involved in the first layer’s processing collectively as a tensor w1. The output of the first layer is x2, which also acts as the input to the second‐layer processing. This processing continues until all layers in the CoNN have been processed, upon which xL is outputted.
An additional layer, however, is added for backward error propagation, a method that learns good parameter values in the CoNN. Let us suppose the problem at hand is an image classification problem with C classes. A commonly used strategy is to output xL as a C−dimensional vector, whose i‐th entry encodes the prediction (the posterior probability of x1 comes from the i‐th class). To make xL a probability mass function, we can set the processing in the (L − 1)‐th layer as a softmax transformation of xL − 1. In the other applications, the output xL may have other forms and interpretations.
The last layer is a loss layer. Let us suppose t is the corresponding target (ground truth) value for the input x1; then a cost or loss function can be used to measure the discrepancy between the CoNN prediction xL and the target t. For example, a simple loss function could be z = ‖t − xL‖2/2, although more complex loss functions are usually used. This squared ℓ2 loss can be used in a regression problem. In a classification problem, the cross‐entropy loss is often used. The ground truth in a classification problem is a categorical variable t. We first convert the categorical variable t to a C−dimensional vector . Now both t and xL are probability mass functions, and the cross‐entropy loss measures the distance between them. Hence, we can minimize the cross‐entropy. The operation chain explicitly models the loss function as a loss layer whose processing is modeled as a box with parameters wL. Note that some layers may not have any parameters; that is, wi may be empty for some i. The softmax layer is one such example.
The forward run: If all the parameters of a CoNN model w1, … , wL − 1 have been learned, then we are ready to use this model for prediction, which only involves running the CNN model forward, that is, in the direction of the arrows in the operational chain. Starting from the input x1, we make it pass the processing of the first layer (the box with parameters w1), and get x2. In turn, x2 is passed into the second layer, and so on. Finally, we achieve xL ∈ ℝC, which estimates the posterior probabilities of x1 belonging to the C categories. We can output the CNN prediction as arg maxi .
SGD: As before in this chapter, the parameters of a CoNN model are optimized to minimize the loss z; that is, we want the prediction of a CoNN model to match the ground‐truth labels. Let us suppose one training example x1 is given for training such parameters. The training process involves running the CoNN network in both directions. We first run the network in the forward pass to get xL to achieve a prediction using the current CoNN parameters. Instead of outputting a prediction, we need to compare the prediction with the target t corresponding to x1, that is, continue running the forward pass until the last loss layer. Finally, we achieve a loss z. The loss z is then a supervision signal, guiding how the parameters of the model should be modified (updated). And the SGD method of modifying the parameters is wi ← wi − η∂z/∂wi. Here, the ←sign implicitly indicates that the parameters wi (of the i‐layer) are updated from time t to t + 1. If a time index t is explicitly used, this equation will look like (wi)t + 1 = (wi)t − η∂z/∂(wi)t.
Error backpropagation: As before, the last layer’s partial derivatives are easy to compute. Because xL is connected to z directly under the control of parameters wL, it is easy to compute ∂z/∂wL .This step is only needed when wL is not empty. Similarly, it is also easy to compute If the squared ℓ2 loss is used, we have an empty ∂z/∂wL and ∂z/∂wL = xL − t. For every layer i, we compute two sets of gradients: the partial derivatives of z with respect to the parameters wi and that layer’s input xi. The term ∂z/∂wi can be used to update the current (i‐th) layer’s parameters, while ∂z/∂xi can be used to update parameters backward, for example, to the (i − 1)‐th layer. An intuitive explanation is that xi is the output of the (i − 1)‐th layer and ∂z/∂xi is how xi should be changed to reduce the loss function. Hence, we could view ∂z/∂xi as the part of the “error” supervision information propagated from z backward until the current layer, in a layer‐by‐layer fashion. Thus, we can continue the backpropagation process and use ∂z/∂xito propagate the errors backward to the (i − 1)‐th layer. This layer‐by‐layer backward updating procedure makes learning a CoNN much easier. When we are updating the i‐th layer, the backpropagation process for the (i + 1)‐th layer must have been completed. That is, we must already have computed the terms ∂z/∂wi + 1 and ∂z/∂xi + 1. Both are stored in memory and ready for use.Now our task is to compute ∂z/∂wi and ∂z/∂xi. Using the chain rule, we have
Since ∂z/∂xi + 1 is already computed and stored in memory, it requires just a matrix reshaping operation (vec) and an additional transpose operation to get ∂z/∂(vec(xi + 1)T). As long as we can compute ∂vec(xi + 1)/∂(vec(wi)T) and vec(xi + 1)/∂(vec(xi)T), we can easily get Eq. (3.79). The terms ∂vec(xi + 1)/∂(vec(wi)T) and ∂vec(xi + 1)/∂(vec(xi)T) are much easier to compute than directly computing ∂z/∂(vec(wi)T) and ∂vec(xi + 1)/∂(vec(xi)T) because xi is directly related to xi + 1 through a function with parameters wi. The details of these partial derivatives will be discussed in the following sections.