Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 60

4.2.1 Pixel‐wise Decomposition

We start with the concept of pixel‐wise image decomposition, which is designed to understand the contribution of a single pixel of an image x to the prediction f(x) made by a classifier f in an image classification task. We would like to find out, separately for each image x, which pixels contribute to what extent to a positive or negative classification result. In addition, we want to express this extent quantitatively by a measure. We assume that the classifier has real‐valued outputs with mapping f: ℝ^V → ℝ¹ such that f(x) > 0 denotes the presence of the learned structure. We are interested in finding out the contribution of each input pixel x_(d) of an input image x to a particular prediction f(x). The important constraint specific to classification consists in finding the differential contribution relative to the state of maximal uncertainty with respect to classification, which is then represented by the set of root points f(x₀) = 0. One possible way is to decompose the prediction f(x) as a sum of terms of the separate input dimensions x_d :

(4.1)

Here, the qualitative interpretation is that R_d < 0 contributes evidence against the presence of a structure that is to be classified, whereas R_d > 0 contributes evidence for its presence. More generally, positive values should denote positive contributions and negative values, negative contributions.

LRP: Returning to multilayer ANNs, we will introduce LRP as a concept defined by a set of constraints. In its general form, the concept assumes that the classifier can be decomposed into several layers of computation, which is a structure used in Deep NN. The first layer are the inputs, the pixels of the image; and the last layer is the real‐valued prediction output of the classifier f. The l‐th layer is modeled as a vector with dimensionality V(l). LRP assumes that we have a relevance score for each dimension of the vector z at layer l + 1. The idea is to find a relevance score for each dimension of the vector z at the next layer l which is closer to the input layer such that the following equation holds:

(4.2)

Iterating Eq. (4.2) from the last layer, which is the classifier output f(x), back to the input layer x consisting of image pixels then yields the desired Eq. (4.1). The relevance for the input layer will serve as the desired sum decomposition in Eq. (4.1). In the following, we will derive further constraints beyond Eqs. (4.1) and (4.2) and motivate them by examples. A decomposition satisfying Eq. (4.2) per se is neither unique, nor it is guaranteed that it yields a meaningful interpretation of the classifier prediction.

As an example, suppose we have one layer. The inputs are x ∈ ℝ^V. We use a linear classifier with some arbitrary and dimension‐specific feature space mapping φ_d and a bias b:

(4.3)

Let us define the relevance for the second layer trivially as . Then, one possible LRP formula would be to define the relevance R⁽¹⁾ for the inputs x as

(4.4)

This clearly satisfies Eqs. (4.1) and (4.2); however, the relevances R⁽¹⁾(x_d) of all input dimensions have the same sign as the prediction f(x). In terms of pixel‐wise decomposition interpretation, all inputs point toward the presence of a structure if f(x) > 0 and toward the absence of a structure if f(x) < 0. This is for many classification problems not a realistic interpretation. As a solution, for this example we define an alternative

(4.5)

Then, the relevance of a feature dimension x_d depends on the sign of the term in Eq. (4.5). This is for many classification problems a more plausible interpretation. This second example shows that LRP is able to deal with nonlinearities such as the feature space mapping φ_d to some extent and how an example of LRP satisfying Eq. (4.2) may look like in practice.

The above example gives an intuition about what relevance R is, namely, the local contribution to the prediction function f(x). In that sense, the relevance of the output layer is the prediction itself: f(x). This first example shows what one could expect as a decomposition for the linear case. The linear case is not a novelty; however, it provides a first intuition. A more graphic and nonlinear example is given in Figure 4.1. The upper part of the figure shows a neural‐network‐shaped classifier with neurons and weights w_ij on connections between neurons. Each neuron i has an output a_i from an activation function.The top layer consists of one output neuron, indexed by 7. For each neuron i we would like to compute a relevance R_i . We initialize the top layer relevance as the function value; thus, . LRP in Eq. (4.2) requires now to hold

(4.6)

(4.7)

We will make two assumptions for this example. First, we express the layer‐wise relevance in terms of messages between neurons i and j which can be sent along each connection. The messages are, however, directed from a neuron toward its input neurons, in contrast to what happens at prediction time, as shown in the lower part of Figure 4.1. Second, we define the relevance of any neuron except neuron 7 as the sum of incoming messages (k : i is input for neuron k):

(4.8)

Figure 4.1 (a) Neural network (NN) as a classifier, (b) NN during the relevance computation.

For example, Note that neuron 7 has no incoming messages anyway. Instead, its relevance is defined as . In Eq. (4.8) and the following text, the terms input and source have the meaning of being an input to another neuron in the direction defined during the time of classification, not during the time of computation of LRP. For example in Figure 4.1, neurons 1 and 2 are the inputs and source for neuron 4, while neuron 6 is the sink for neurons 2 and 3. Given the two assumptions encoded in Eq. (4.8), the LRP by Eq. (4.2) can be satisfied by the following sufficient condition:

and . In general, this condition can be expressed as

(4.9)

The difference between condition (4.9) and definition (4.8) is that in condition (4.9) the sum runs over the sources at layer l for a fixed neuron k at layer l + 1, whereas in definition (4.8) the sum runs over the sinks at layer l + 1 for a fixed neuron i at a layer l. When using Eq. (4.8) to define the relevance of a neuron from its messages, then condition (4.9) is a sufficient condition I order to ensure that Eq. (4.2) holds. Summing over the left hand side in Eq. (4.9) yields

One can interpret condition (4.9) by saying that the messages are used to distribute the relevance of a neuron k onto its input neurons at layer l. In the following sections, we will use this notion and the more strict form of relevance conservation as given by definition (4.8) and condition (4.9). We set Eqs. (4.8) and (4.9) as the main constraints defining LRP. A solution following this concept is required to define the messages according to these equations.

Now we can derive an explicit formula for LRP for our example by defining the messages . The LRP should reflect the messages passed during classification time. We know that during classification time, a neuron i inputs a_i w_ik to neuron k, provided that i has a forward connection to k. Thus, we can rewrite expressions for and so that they match the structure of the right‐hand sides of the same equations by the following:

(4.10)

(4.11)

The match of the right‐hand sides of the initial expressions for and against the right‐hand sides of Eqs. (4.10) and (4.11) can be expressed in general as

(4.12)

Although this solution, Eq. (4.12), for message terms still needs to be adapted such that it is usable when the denominator becomes zero, the example given in Eq. (4.12) gives an idea of what a message could be, namely, the relevance of a sink neuron that has been already computed, weighted proportionally by the input of neuron i from the preceding layer l.

Taylor‐type decomposition: An alternative approach for achieving a decomposition as in Eq. (4.1) for a general differentiable predictor f is a first‐order Taylor approximation:

(4.13)

The choice of a Taylor base point x₀ is a free parameter in this setup. As stated above, in the case of classification, we are interested in finding out the contribution of each pixel relative to the state of maximal uncertainty of the prediction given by the set of points f(x₀) = 0, since f(x) > 0 denotes the presence and f(x) < 0 denotes the absence of the learned structure. Thus, x₀ should be chosen to be a root of the predictor f. Thus, the above equation simplifies to

(4.14)

The pixel‐wise decomposition contains a nonlinear dependence on the prediction point x beyond the Taylor series, as a close root point x₀ needs to be found. Thus, the whole pixel‐wise decomposition is not a linear, but a locally linear algorithm, as the root point x₀ depends on the prediction point x.

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх