Читать книгу Machine Learning for Tomographic Imaging - Professor Ge Wang - Страница 58

3.1.7 Backpropagation algorithm

Оглавление

For a multi-layer neural network, the initial weights are randomly initialized. As such, the network does not immediately perform well on data. We need to iteratively adjust these weights based on the error between training labels and predicted values by the network so as to obtain a converged/trained neural network with an appropriate internal representation, which can then map from an input to a desirable output. Such a training process is almost exclusively done using the backpropagation algorithm (Rumelhart et al 1986).

Backpropagation is a shorter term for ‘error backward propagation’, and essentially the gradient descent optimization. In backpropagation, the error or loss is calculated at the output port of the network, and then distributed backwards throughout the network layers. How does the backpropagation work exactly? Let us explain it now, illustrating the backpropagation process with a simple neural network.

Backpropagation is achieved by iteratively calculating the gradient of the loss function layer by layer using the chain rule. In order to illustrate the principle of backpropagation, we rewrite the loss function equation (3.14), as follows:

J(W;b)=argminθ1n∑i=1nLW,b;x(i),y(i).(3.25)

Compared to equation (3.14), the loss function equation (3.25) is a more specific formula that contains all parameters to be trained in the neural network, as well as training data and labels. Each gradient descent iteration updates the parameters W, b in the matrix/vector form as follows:

wij(l+1)=wij(l)−α∂wij(l)J(W,b)(3.26)

bi(l+1)=bi(l)−α∂bi(l)J(W,b),(3.27)

where α is the learning rate, wij(l) denotes the weight between a neuron i in a layer l and the neuron j in the layer (l−1), and bi(l) denotes the bias for the neuron i in the layer l. The key step is to compute the above partial derivatives. The following paragraphs describe the backpropagation algorithm, which is an efficient way to compute these partial derivatives.

As an example, consider a neural network which consists of two input units, one output unit, and no hidden unit, and each neuron only outputs the weighted sum of its input data, as shown in figure 3.15.


Figure 3.15. Toy neural network with two input units and one output unit.

For such a simple neural network, the loss function is as follows:

L=12(y−yˆ)2,(3.28)

where L is the squared error, y is the target output for a training sample, and yˆ is the actual output.

With the sigmoid function as the activation function, for each neuron j, its output oj is equal to

oj=σsj(3.29)

sj=∑k=1nwkjok,(3.30)

where the inner product sj inside a neuron is generally the weighted sum of outputs ok of the previous neurons, and when the neuron in the first layer after the input layer sj is the weighted sum of input data xk, k = 1, 2. The number of input variables to the neuron is often denoted by n, and σ is the sigmoid function defined by equation (3.3).

The activation function σ is nonlinear and differentiable, whose derivative is as follows:

dσdx(x)=σ(x)(1−σ(x)).(3.31)

Then, using the chain rule to calculate the partial derivative of the error to the weight wkj is done using the chain rule:

∂L∂wij=∂L∂oj∂oj∂sj∂sj∂wij.(3.32)

Since in the last factor on the right-hand side only one term in the sum sj depends on wij, we have

∂sj∂wij=∂∂wij∑k=1nwkjok=∂∂wijwijoi=oi.(3.33)

As just mentioned, if the neuron is in the first layer after the input layer, oi is the same as xi. The derivative of the output of neuron j to its input is simply the partial derivative of the activation function:

∂oj∂sj=∂∂sjσsj=σ(sj)(1−σ(sj)).(3.34)

In equation (3.32), the first term can be easy to evaluate when the neuron is in the output layer, because in that case oj=yˆ and

∂L∂oj=∂L∂yˆ=∂∂yˆ12(y−yˆ)2=yˆ−y.(3.35)

However, if j is in a hidden layer of the network, finding the derivative L with respect to oj is less obvious. Let M=u,v,…,w be the set of indices for all neurons that receive inputs from neuron j, we have

∂Loj∂oj=∂Lsu,sv,…,sw∂oj.(3.36)

Then, let us take the total derivative with respect to oj; a recursive expression for the derivative is obtained as follows:

∂Loj∂oj=∑m∈M∂L∂sm∂sm∂oj=∑m∈M∂L∂om∂om∂smWjm.(3.37)

Therefore, the derivative with respect to oj can be calculated if all the derivatives with respect to the outputs om of the next layer, and those for the neurons in the output layer are known. The overall computational process can be expressed as follows:

∂L∂wij=δjoiδj=∂L∂oj∂oj∂sj=(oj−yˆj)oj(1−oj)ifjisanoutputneuron∑m∈Mwjmδmoj(1−oj)ifjisahiddenneuron.(3.38)

In the gradient descent method, the adjustment of wij can be calculated according to the partial derivatives expressed in equation (3.38). In the (n + 1)th iteration, the update of wij can be computed as follows:

wi,jn+1=wi,jn+Δwi,jn,(3.39)

The change in weights reflects how much L will be changed by adjusting wij. If ∂L∂wij>0, an increase in wij leads to an increased L value. On the other hand, if ∂L∂wij<0, an increment in wij means a decrement in L. Clearly, to specify the updating speed or learning rate, a coefficient η is needed:

Δwij=−η∂L∂wij=−ηδjoi.(3.40)

One should choose a learning rate η>0. The product of a proper learning rate and the gradient, multiplied by −1 will guarantee that wij is refined in a way that will decrease the loss function. In other words, the change of wij by −η∂L∂wij will reduce the L value. For more details on the backpropagation algorithm, see Rumelhart et al (1986).

Machine Learning for Tomographic Imaging

Подняться наверх