Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 52

3.6.2 Layers in CoNN

Suppose we are considering the l‐th layer, whose inputs form an order‐3 tensor x^l with . A triplet index set (i^l, j^l, d^l) is used to locate any specific element in x^l. The triplet (i^l, j^l, d^l) refers to one element in x^l, which is in the d^l‐th channel, and at spatial location (i^l, j^l) (at the i^lth row, and j^l‐th column). In actual CoNN learning, the mini‐batch strategy is usually used. In that case, x^l becomes an order‐4 tensor in , where N is the mini‐batch size. For simplicity we assume for the moment that N = 1. The results in this section, however, are easy to adapt to mini‐batch versions. In order to simplify the notations that will appear later, we follow the zero‐based indexing convention, which specifies that 0 ≤ i^l < H^l, 0 ≤ j^l < W^l, and 0 ≤ d^l < D^l. In the l‐th layer, a function will transform the input x^l to an output =x^{l + 1}. We assume the output has size H^{l + 1} × W^{l + 1} × D^{l + 1}, and an element in the output is indexed by a triplet (i^{l + 1}, j^{l + 1}, d^{l + 1}), 0 ≤ i^{l + 1} < H^{l + 1}, 0 ≤ j^{l + 1} < W^{l + 1}, 0 ≤ d^{l + 1} < D^{l + 1}.

The Rectified Linear Unit (ReLU) layer: An ReLU layer does not change the size of the input; that is, x^l and y share the same size. The ReLU can be regarded as a truncation performed individually for every element in the input: with 0 ≤ i < H^l = H^{l + 1}, 0 ≤ j < W^l = W^{l + 1}, and 0 ≤ d < D^l = D^{l + 1}. There is no parameter inside a ReLU layer, and hence there is no need for parameter learning in this layer.

The convolution layer: Figure 3.23 illustrates a convolution of the input image (3 × 4 matrix) and the convolution kernel of size 2 × 2. For order‐3 tensors, the convolution operation is defined similarly. Figure 3.24 illustrates an RGB (black/light gray/gray) image with three channels and three kernels. Suppose the input in the l‐th layer is an order‐3 tensor of size H^l × W^l × D^l. A convolution kernel is also an order‐3 tensor of size H × W × D^l. When we overlap the kernel on top of the input tensor at the spatial location (0, 0, 0), we compute the products of the corresponding elements in all the D^l channels and sum the HWD^l products to get the convolution result at this spatial location. Then, we move the kernel from top to bottom and from left to right to complete the convolution. In a convolution layer, multiple convolution kernels are usually used. Assuming D kernels are used and each kernel is of spatial span H × W, we denote all the kernels as f. f is an order‐4 tensor in . Similarly, we use index variables 0 ≤ i < H, 0 ≤ j < W, 0 ≤ d^l < D^l and 0 ≤ d < D to pinpoint a specific element in the kernels.

Stride is another important concept in convolution. At the bottom of Figure 3.23, we convolve the kernel with the input at every possible spatial location, which corresponds to the stride s = 1. However, if s > 1, every movement of the kernel skips s − 1 pixel locations (i.e., the convolution is performed once every s pixels both horizontally and vertically). In this section, we consider the simple case when the stride is 1 and no padding is used. Hence, we have y (or x^{l + 1}) in , with H^{l + 1} = H^l − H + 1, W^{l + 1} = W^l − W + 1, and D^{l + 1} = D. For mathematical rigor, the convolution procedure can be expressed as an equation:

Figure 3.23 Illustration of the convolution operation. If we overlap the convolution kernel on top of the input image, we can compute the product between the numbers at the same location in the kernel and the input, and we get a single number by summing these products together. For example, if we overlap the kernel with the top‐left region in the input, the convolution result at that spatial location is 1 × 1 + 1 × 4 + 1 × 2 + 1 × 5 = 12. (for more details see the color figure in the bins).

Figure 3.24 RGB image/three channels and three kernels. (for more details see the color figure in the bins).

(3.80)

Convolution as matrix product: There is a way to expand x^l and simplify the convolution as a matrix product. Let us consider a special case with D^l = D = 1, H = W = 2, and H^l = 3, W^l = 4. That is, we consider convolving a small single‐channel 3 × 4 matrix (or image) with one 2 × 2 filter. Using the example in Figure 3.23, we have

(3.81)

where the first matrix is denoted as A, and * is the convolution operator.

The MATLAB command B = im2col(A, [2 2]) gives the B matrix, which is an expanded version of A:

(3.82)

Note that the first column of B corresponds to the first 2 × 2 region in A, in a column‐first order, corresponding to (i^{l + 1}, j^{l + 1}) = (0, 0). Similarly, the second to last column in B correspond to regions in A with (i^{l + 1}, j^{l + 1}) being (1, 0), (0, 1), (1, 1), (0, 2) and (1, 2), respectively. That is, the MATLAB im2col function explicitly expands the required elements for performing each individual convolution to create a column in the matrix B. The transpose, B^T, is called the im2row expansion of A. If we vectorize the convolution kernel itself into a vector (in the same column‐first order) (1, 1, 1, 1)^T, we find that

(3.83)

If we reshape this resulting vector properly, we get the exact convolution result matrix in Eq. (3.81).

If D^l > 1 (x^l has more than one channel, e.g., in Figure 3.24 of RGB image/three channels), the expansion operator could first expand the first channel of x^l, then the second, … , until all D^l channels are expanded. The expanded channels will be stacked together; that is, one row in the im2row expansion will have H × W × D^l elements, rather than H × W.

Suppose x^l is a third‐order tensor in , with one element in x^l represented by (i^l, j^l, d^l), and f is a set of convolution kernels whose spatial extent are all H × W. Then, the expansion operator (im2row) converts x^l into a matrix φ(x^l) with elements indexed as (p, q). The expansion operator copies the element at (i^l, j^l, d^l) in x^l to the (p, q)‐th entry in φ(x^l). From the description of the expansion process, given a fixed (p, q), we can calculate its corresponding (i^l, j^l, d^l) triplet from the relation

(3.84)

As an example, dividing q by HW and take the integer part of the quotient, we can determine which channel (d^l) belongs to.

We can use the standard vec operator to convert the set of convolution kernels f (an order‐4 tensor) into a matrix. Starting from one kernel, which can be vectorized into a vector in , all convolution kernels can be reshaped into a matrix with HWD^l rows and D columns with D^l+1 = D referred to as matrix F. With these notations, we have an expression to calculate convolution results (in Eq. (3.83), φ(x^l) is B^T):

(3.85)

with , φ(x^l) , and . The matrix multiplication φ(x^l)F results in a matrix of size (H^{l + 1} W^{l + 1}) × D. The vectorization of this resultant matrix generates a vector in , which matches the dimensionality of vec(y).

The Kronecker product: Given two matrices A ∈ ℝ^{m × n} and B ∈ ℝ^{p × q}, the Kronecker product A ⊗ B is a mp × nq matrix, defined as a block matrix

(3.86)

The Kronecker product has the following properties that will be useful for us:

(3.87)

(3.88)

The last equation can be utilized from both directions. Now we can write down

(3.89)

(3.90)

where I is an identity matrix of appropriate size. In Eq. (3.89), the size of I is determined by the number of columns in F, and hence I ∈ ℝ^{D × D}. In Eq. (3.90), For the derivation of the gradient computation rules in a convolution layer, the notation summarized in Table 3.3 will be used.

Update the parameters – backward propagation: First, we need to compute ∂z/∂vec(x^l) and z/∂vec(F), where the first term will be used for backward propagation to the previous (l − 1)th layer, and the second term will determine how the parameters of the current (l−th) layer will be updated. Keep in mind that f, F, and wⁱ refer to the same thing (modulo reshaping of the vector or matrix or tensor). Similarly, we can reshape y into a matrix ; then y, Y, and x^{l + 1} refer to the same object (again, modulo reshaping).

Table 3.3 Variables, for the derivation of gradient with ϕ ↔ φ_.

	Alias	Size and Meaning
X	x ^l	H^l W^l × D^l, the input tensor
F	f , w ^l	HW D^l × D, D kernels, each H × W and D^l channels
Y	y , x ^l+1	H^{l + 1} W^{l + 1} × D^{l + 1}, the output, D^{l + 1} = D
ϕ( x ^l)		H^{l + 1} W^{l + 1} × HW D^l, the `im2row` expansion of x ^l
M		H^{l + 1} W^{l + 1} HW D^l × H^l W^l D^l, the indictor matrix for ϕ( x ^l)
		H^{l + 1} W^{l + 1} × D^{l + 1}, gradient for y
		HW D^l × D, gradient to update the convolution kernels
		H^l W^l × D^l, gradient for x ^l, useful for back propagation

From the chain rule, it is easy to compute ∂z/∂vec(F) as

(3.91)

The first term on the right in Eq. (3.91) is already computed in the (l + 1)‐th layer as ∂z/∂(vec(x^{l + 1}))^T. Based on Eq. (3.89), we have

(3.92)

We have used the fact that ∂Xa^T/∂a = X or ∂Xa/∂a^T = X so long as the matrix multiplications are well defined. This equation leads to

(3.93)

Taking the transpose, we get

(3.94)

Both Eqs. (3.87) and (3.88) are used in the above derivation giving ∂z/∂F= φ(x^l)^T ∂z/∂Y, which is a simple rule to update the parameters in the l−th layer: the gradient with respect to the convolution parameters is the product between φ(x^l)^T (the im2col expansion) and ∂z/∂Y (the supervision signal transferred from the (l + 1)‐th layer).

Function φ(x^l) has dimension H^{l + 1} W^{l + 1} HW D^l. From the above, we know that its elements are indexed by a pair p,q. So far, from Eq. (3.84) we know: (i) from q we can determine d^l, the channel of the convolution kernel that is used; and we can also determine i and j, the spatial offsets inside the kernel; (ii) from p we can determine i^{l + 1} and j^{l + 1}, the spatial offsets inside the convolved result x^{l + 1}; and (iii) the spatial offsets in the input x^l can be determined as i^l = i^{l + 1} + i and j^l = j^{l + 1} + j. In other words, the mapping m: (p, q) → (i^l, j^l, d^l) is one to one, and thus is a valid function. The inverse mapping, however, is one to many (and thus not a valid function). If we use m⁻¹ to represent the inverse mapping, we know that m⁻¹(i^l, j^l, d^l) is a set S, where each (p, q) ∈ S satisfies m(p, q) = (i^l, j^l, d^l). Now we take a look at φ(x^l) from a different perspective.

The question: What information is required in order to fully specify this function? It is obvious that the following three types of information are needed (and only those). The answer: For every element of φ(x^l), we need to know

(A) Which region does it belong to, or what is the value of (0 ≤ p< H^{l + 1} W^{l + 1})?

(B) Which element is it inside the region (or equivalently inside the convolution kernel); that is, what is the value of q(0 ≤ q< HWDl )? The above two types of information determine a location (p, q) inside φ(x^l). The only missing information is (C) What is the value in that position, that is, [φ(x^l)]pq?

Since every element in φ(x^l) is a verbatim copy of one element from x^l, we can reformulate question (C) into a different but equivalent one:

(C.1) Where is the value of a given [φ(x^l)]pq copied from? Or, what is its original location inside x^l, that is, an index u that satisfies 0 ≤ u < H^l W^l D^l? (C.2) The entire x^l.

It is easy to see that the collective information in [A, B, C.1] (for the entire range of p, q, and u, and (C.2) (x^l) contains exactly the same amount of information as φ(x^l). Since 0 ≤ p < H^{l + 1} W^{l + 1}, 0 ≤ q< HW D^l, and 0 ≤ u < H^l W^l D^l, we can use a a matrix to encode the information in [A, B, C.1]. One row index of this matrix corresponds to one location inside φ(x^l) (a(p, q) pair). One row of M has H^l W^l D^l elements, and each element can be indexed by (i^l, j^l, d^l). Thus, each element in this matrix is indexed by a 5‐tuple: (p, q, i^l, j^l, d^l).

Then, we can use the “indicator” method to encode the function m(p, q) = (i^l, j^l, d^l) into M. That is, for any possible element in M, its row index x determines a(p, q) pair, and its column index y determines a(i^l, j^l, d^l) triplet, and M is defined as

(3.95)

The M matrix is very high dimensional. At the same time, it is also very sparse: there is only one nonzero entry in the H^l W^l D^l elements in one row, because m is a function. M, which uses information [A, B, C.1], encodes only the one‐to‐one correspondence between any element in φ(x^l) and any element in x^l; it does not encode any specific value in x^l. Putting together the one‐to‐one correspondence information in M and the value information in x^l, we have

(3.96)

Supervision signal for the previous layer: In the l‐th layer, we need to compute ∂z/∂vec(x^l). For that, we want to reshape x^l into a matrix , and use these two equivalent forms (modulo reshaping) interchangeably. By the chain rule, ∂z/∂(vec(x^l)^T) = [∂z/∂(vec(y)^T)][∂vec(y)/∂(vec(x^l)^T)].

By utilizing Eqs. (3.90) and (3.96), we have

(3.97)

(3.98)

Since by using Eq. (3.88)

(3.99)

we have

(3.100)

or equivalently

(3.101)

In Eq. (3.101), , and vec((∂z/∂Y)F^T) is a vector in . At the same time, M^T is an indicator matrix in In order to locate one element in vec(x^l) or one row in M^T, we need an index triplet (i^l, j^l, d^l), with 0 ≤ i^l < H^l, 0 ≤ j^l < W^l, and 0 ≤ d^l < D^l. Similarly, to locate a column in M^T or an element in ∂z/∂Y)F^T, we need an index pair p ,q), with 0≤p < H^{l + 1} W^{l + 1}and ≤ q< HW D^l. Thus, the (i^l, j^l, d^l )‐th entry of ∂z/∂(vec(x^l)) is the product of two vectors: the row in M^T (or the column in M) that is indexed by (i^l, j^l, d^l), and vec((∂z/∂Y)F^T). Since M^T is an indicator matrix, in the row vector indexed by (i^l, j^l, d^l), only those entries whose index (p, q) satisfies m(p, q) = (i^l, j^l, d^l) have a value 1, and all other entries are 0. Thus, the (i^l, j^l, d^l )‐th entry of ∂z/∂(vec(x^l)) equals the sum of these corresponding entries in vec((∂z/∂Y)F^T). Therefore, we get the following succinct equation:

(3.102)

In other words, to compute ∂z/∂X, we do not need to explicitly use the extremely high‐dimensional matrix M. Instead, Eqs. (3.102) and (3.84) can be used to efficiently find it. The convolution example from Figure 3.23 is used to illustrate the inverse mapping m⁻¹ in Figure 3.25.

In the right half of Figure 3.25, the 6 × 4 matrix is ∂z/∂Y)F^T. In order to compute the partial derivative of z with respect to one element in the input X, we need to find which elements in ∂z/∂Y)F^T are involved and add them. In the left half of Figure 3.25, we see that the input element 5 (shown in larger font) is involved in four convolution operations, shown by the gray, light gray, dotted gray and black boxes, respectively. These four convolution operations correspond to p = 1, 2, 3, 4. For example, when p = 2 (the light gray box), 5 is the third element in the convolution, and hence q = 3 when p = 2, and we put a light gray circle in the (2, 3)‐th element of the (∂z/∂Y)F^Tmatrix. After all four circles are put in the matrix (∂z/∂Y)F^T,the partial derivative is the sum of ellements in these four locations of (∂z/∂Y)F^T. The set m⁻¹(i^l, j^l, d^l) contains at most HWD^l elements. Hence, Eq. (3.102) requires at most HWD^l summations to compute one element of ∂z/∂X.

The pooling layer: Let be the input to the l‐th layer, which is now a pooling layer. The pooling operation requires no parameter (i.e., wⁱ is null, and hence parameter learning is not needed for this layer). The spatial extent of the pooling (H × W) is specified in the design of the CoNN structure. Assume that H divides H^l and W divides W^l and the stride equals the pooling spatial extent, the output of pooling (y or equivalently x^{l + 1}) will be an order‐3 tensor of size H^{l + 1} × W^{l + 1} × D^{l + 1}, with H^{l + 1} = H^l/H, W^{l + 1} = W^l/W, D^{l + 1} = D^l. A pooling layer operates upon x^l channel by channel independently. Within each channel, the matrix with H^l × W^l elements is divided into H^{l + 1} × W^{l + 1} nonoverlapping subregions, each subregion being H × W in size. The pooling operator then maps a subregion into a single number. Two types of pooling operators are widely used: max pooling and average pooling. In max pooling, the pooling operator maps a subregion to its maximum value, while the average pooling maps a subregion to its average value as illustrated in Figure 3.26.

Figure 3.25 Computing ∂z/∂X. (for more details see the color figure in the bins).

Figure 3.26 Illustration of pooling layer operation. (for more details see the color figure in the bins).

Formally this can be represented as

(3.103)

where 0 ≤ i^{l + 1} < H^{l + 1}, 0 ≤ j^{l + 1} < W^{l + 1}, and 0 ≤ d < D^{l + 1} = D^l.

Pooling is a local operator, and its forward computation is straightforward. When focusing on the backpropagation, only max pooling will be discussed and we can resort to the indicator matrix again. All we need to encode in this indicator matrix is: for every element in y, where does it come from in x^l?

We need a triplet (i^l, j^l, d^l) to locate one element in the input x^l, and another triplet (i^{l + 1}, j^{l + 1}, d^{l + 1}) to locate one element in y. The pooling output comes from , if and only if the following conditions are met: (i) they are in the same channel; (ii) the (i^l, j^l)‐th spatial entry belongs to the (i^{l + 1}, j^{l + 1} )‐th subregion; and (iii) the (i^l, j^l)‐th spatial entry is the largest one in that subregion. This can be represented as

where ⌊·⌋ is the floor function. If the stride is not H(W) in the vertical (horizontal) direction, the equation must be changed accordingly. Given a (i^{l + 1}, j^{l + 1}, d^{l + 1}) triplet, there is only one (i^l, j^l, d^l) triplet that satisfies all these conditions. So, we define an indicator matrix .

One triplet of indexes (i^{l + 1}, j^{l + 1}, d^{l + 1}) specifies a row in S, while (i^l, j^l, d^l) specifies a column. These two triplets together pinpoint one element in (x^l). We set that element to 1 if d^{l + 1} = d^l and , are simultaneously satisfied, and 0 otherwise. One row of S(x^l) corresponds to one element in y, and one column corresponds to one element in x^l. By using this indicator matrix, we have vec(y) = S(x^l)vec(x^l) and

Figure 3.27 Illustration of preprocessing in a cooperative neural network (CoNN)‐based image classifier.

resulting in

(3.104)

S(x^l) is very sparse since it has only one nonzero entry in every row. Thus, we do not need to use the entire matrix in the computation. Instead, we just need to record the locations of those nonzero entries – there are only H^{l + 1} W^{l + 1} D^{l + 1} such entries in S(x^l). Figure 3.27 illustrates preprocessing in a CoNN‐based image classifier.

For further readings on CoNNs, the reader is referred to [30].

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх