Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 42
3.2.3 Adaptation
ОглавлениеFor supervised learning with input sequence x(k), the difference between the desired output at time k and the actual output of the network is the error
(3.17)
The total squared error over the sequence is given by
(3.18)
The objective of training is to determine the set of FIR filter coefficients (weights) that minimizes the cost J subject to the constraint of the network topology. A gradient descent approach will be utilized again in which the weights are iteratively updated.
For instantaneous gradient descent, FIR filters may be updated at each time slot as
(3.19)
where is the instantaneous gradient estimate, and μ is the learning rate. However, deriving an expression for this parameter results in an overlapping of number of chain rules. A simple backpropagation‐like formulation does not exist anymore.
Temporal backpropagation is an alternative approach that can be used to avoid the above problem. To discuss it, let us consider two alternative forms of the true gradient of the cost function:
(3.20)
Note that
only their sum over all k is equal. Based on this new expansion, each term in the sum is used to form the following stochastic algorithm:
For small learning rates, the total accumulated weight change is approximately equal to the true gradient. This training algorithm is termed temporal backpropagation.
To complete the algorithm, recall the summing junction is defined as
(3.22)
where intermediate variable are defined for convenience. The partial derivative in Eq. (3.21) is easily evaluated as
(3.23)
This holds for all layers in the network. Defining allows us to rewrite Eq. (3.21) as
(3.24)
We now show that a simple recursive formula exists for finding . Starting with the output layer, we observe that influences only the instantaneous output node error ej(k). Thus, we have
(3.25)
For a hidden layer, has an impact on the error indirectly through all node values in the subsequent layer. Due to the tap delay lines, also has an impact on the error across time. Therefore, the chain rule now becomes
where by definition . Continuing with the remaining term
(3.27)
Now
(3.27a)
since the only influence has on is via the synapse connecting unit j in layer l to unit m in layer l + 1. The definition of the synapse is explicitly given as
(3.28)
Thus
(3.29)
(3.30)
Making all substitutions into Eq. (3.26), we get
(3.31)
where we have defined the vector
(3.32)
Figure 3.7 Temporal backpropagation.
Each term within the sum corresponds to a reverse FIR filter. This is illustrated in Figure 3.7. The filter is drawn in such a way to emphasize the reversal of signal propagation through the FIR. Representing the forward propagation of states and the backward propagation of error terms requires simply reversing the direction of signal flow. In this process, unit delay operators q−1 should be replaced with unit advances q+1. The complete adaptation algorithm can be summarized as follows:
(3.34)
The bias weight may again be adapted by letting in Eq. (3.33). Observe the similarities between these equations and those for standard backpropagation. In fact, by replacing the vectors a, w, and δ by scalars, the previous equations reduce to precisely the backpropagation algorithm for static networks. Differences in the temporal version are due to implicit time relations. To find , we filter the δ’s from the next layer backward through the FIR (see Figure 3.7). In other words, δ’s are created not only by taking weighted sums, but also by backward filtering. For each x(k) and desired vector d(k), the forward filters are incremented one time step, producing the current output y(k) and corresponding error e(k). Next, the backward filters are incremented one time step, advancing the δ(k) terms and allowing the filter coefficients to be updated. The process is then repeated for a new input at time k + 1.
The symmetry between the forward propagation of states and the backward propagation of error terms is preserved in temporal backpropagation. The number of operations per iteration now grows linearly with the number of layers and synapses in the network. This savings is due to the efficient recursive formulation. Each coefficient enters into the calculation only once, in contrast to the redundant use of terms when applying standard backpropagation to the unfolded network.