Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 61

4.2.2 Pixel‐wise Decomposition for Multilayer NN

Pixel‐wise decomposition for multilayer networks: In the previous chapter, we discussed NN networks built as a set of interconnected neurons organized in a layered structure. They define a mathematical function when combined with each other that maps the first‐layer neurons (input) to the last‐layer neurons (output). In this section, we denote each neuron by x_i , where i is an index for the neuron. By convention, we associate different indices for each layer of the network. We denote by ∑_i the summation over all neurons of a given layer, and by ∑_j the summation over all neurons of another layer. We denote by x_(d) the neurons corresponding to the pixel activations (i.e., with which we would like to obtain a decomposition of the classification decision). A common mapping from one layer to the next one consists of a linear projection followed by a nonlinear function: z_ij = x_j w_ij , z_j = ∑_i z_ij + b_j , x_j = g(z_j), where w_ij is a weight connecting neuron x_i to neuron x_j, b_j is a bias term, and g is a nonlinear activation function. Multilayer networks stack several of these layers, each of them being composed of a large number of neurons. Common nonlinear functions are the hyperbolic tangent g(t) = tanh (t) or the rectification function g(t) = max (0, t)

Taylor‐type decomposition: Denoting by f : ℝ^M ↦ ℝ^N the vector‐valued multivariate function implementing the mapping between input and output of the network, a first possible explanation of the classification decision x ↦ fx) can be obtained by Taylor expansion at a near root point x₀ of the decision function f:

(4.15)

The derivative ∂fx)/∂x_(d) required for pixel‐wise decomposition can be computed efficiently by reusing the network topology using the backpropagation algorithm discussed in the previous chapter. Having backpropagated the derivatives up to a certain layer j, we can compute the derivative of the previous layer i using the chain rule:

(4.16)

Figure 4.2 Relevance propagation.

Layer‐wise relevance backpropagation: As an alternative to Taylor‐type decomposition, it is possible to compute relevances at each layer in a backward pass, that is, express relevances as a function of upper‐layer relevances , and backpropagating relevances until we reach the input (pixels). Figure 4.2 depicts a graphic example. The method works as follows: Knowing the relevance of a certain neuron for the classification decision f(x), one would like to obtain a decomposition of this relevance in terms of the messages sent to the neurons of the previous layers. We call these messages R_{i ← j}. As before, the conservation property

(4.17)

must hold. In the case of a linear network f(x) = ∑_i z_ij where the relevance R_j = f(x), such a decomposition is immediately given by R_{i ← j} = z_ij . However, in the general case, the neuron activation x_j is a nonlinear function of z_j . Nevertheless, for the hyperbolic tangent and the rectifying function – two simple monotonically increasing functions satisfying g(0) = 0‐the pre‐activations z_ij still provide a sensible way to measure the relative contribution of each neuron x_i to R_j. A first possible choice of relevance decomposition is based on the ratio of local and global pre‐activations and is given by

(4.18)

These relevances R_{i ← j} are easily shown to approximate the conservation properties, in particular:

(4.19)

where the multiplier accounts for the relevance that is absorbed (or injected) by the bias term. If necessary, the residual bias relevance can be redistributed onto each neuron x_i. A drawback of the propagation rule of Eq. (4.18) is that for small values z_j , relevances R_{j ← j} can take unbounded values. Unboundedness can be overcome by introducing a predefined stabilizer ε ≥ 0:

(4.20)

The conservation law then becomes

(4.21)

where we can observe that some further relevance is absorbed by the stabilizer. In particular, relevance is fully absorbed if the stabilizer ε becomes very large.

An alternative stabilizing method that does not leak relevance consists of treating negative and positive pre‐activations separately. Let and where – and + denote the negative and positive parts of z_ij and b_j . Relevance propagation is now defined as

(4.22)

where α + β = 1. For example, for αβ = 1/2, the conservation law becomes

(4.23)

which has a similar form as Eq. (4.19). This alternative propagation method also allows one to manually control the importance of positive and negative evidence, by choosing different factors α and β.

Once a rule for relevance propagation has been selected, the overall relevance of each neuron in the lower layer is determined by summing up the relevances coming from all upper‐layer neurons in agreement with Eqs. (4.8) and (4.9):

(4.24)

Figure 4.3 Relevance propagation (heat map; relevance is presented by the intensity of the red color).

Source: Montavon et al. [92].

The relevance is backpropagated from one layer to another until it reaches the input pixels x_(d), and where the relevances provide the desired pixel‐wise decomposition of the decision f(x). A practical example of relevance propagation obtained by Deep Taylor decomposition is shown in Figure 4.3 [92].

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх