Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 48
3.4.3 Advanced RNN Architectures
ОглавлениеThe most popular RNN architectures for sequence learning evolved from long short‐term memory (LSTM) [8] and bidirectional recurrent neural networks (BRNNs) [9] schemes. The former introduces the memory cell, a unit of computation that replaces traditional nodes in the hidden layer of a network. With these memory cells, networks are able to overcome difficulties with training encountered by earlier recurrent networks. The latter introduces an architecture in which information from both the future and the past are used to determine the output at any point in the sequence. This is in contrast to previous networks, in which only past input can affect the output, and has been used successfully for sequence labeling tasks in natural language processing, among others. The two schemes are not mutually exclusive, and have been successfully combined for phoneme classification [10] and handwriting recognition [11]. In this section, we explain the LSTM and BRNN, and we describe the neural Turing machine (NTM), which extends RNNs with an addressable external memory [12].
LSTM scheme: This was introduced primarily in order to overcome the problem of vanishing gradients. This model resembles a standard RNN with a hidden layer, but each ordinary node in the hidden layer is replaced by a memory cell (Figure 3.19). Each memory cell contains a node with a self‐connected recurrent edge of fixed weight one, ensuring that the gradient can pass across many time steps without vanishing or exploding. To distinguish references to a memory cell and not an ordinary node, we use the subscript c.
Simple RNNs have long‐term memory in the form of weights. The weights change slowly during training, encoding general knowledge about the data. They also have short‐term memory in the form of ephemeral activations, which pass from each node to successive nodes. The LSTM model introduces an intermediate type of storage via the memory cell. A memory cell is a composite unit, built from simpler nodes in a specific connectivity pattern, with the novel inclusion of multiplicative nodes, represented in diagrams by the letter X. All elements of the LSTM cell are enumerated and described below.
Figure 3.19 A long short‐term memory (LSTM) memory cell.
Note that when we use vector notation, we are referring to the values of the nodes in an entire layer of cells. For example, s is a vector containing the value of sc at each memory cell c in a layer. When the subscript c is used, it is to index an individual memory cell.
Input node: This unit, labeled gc , is a node that takes activation in the standard way from the input layer x(t) at the current time step and (along recurrent edges) from the hidden layer at the previous time step h(t − 1). Typically, the summed weighted input is run through a tanh activation function, although in the original LSTM paper, the activation function is a sigmoid.
Input gate: Gates are a distinctive feature of the LSTM approach. A gate is a sigmoidal unit that, like the input node, takes activation from the current data point x(t) as well as from the hidden layer at the previous time step. A gate is so called because its value is used to multiply the value of another node. It is a gate in the sense that if its value is 0, then flow from the other node is cut off. If the value of the gate is 1, all flow is passed through. The value of the input gate ic multiplies the value of the input node.
Internal state: At the heart of each memory cell is a node sc with linear activation, which is referred to in the original work as the “internal state” of the cell. The internal state sc has a self‐connected recurrent edge with fixed unit weight. Because this edge spans adjacent time steps with constant weight, error can flow across time steps without vanishing or exploding. This edge is often called the constant error carousel. In vector notation, the update for the internal state is s(t) = g(t) ⊙ i(t) + s(t − 1) where ⊙ is pointwise multiplication.
Forget gate: These gates fc were introduced to provide a method by which the network can learn to flush the contents of the internal state. This is especially useful in continuously running networks. With forget gates, the equation to calculate the internal state on the forward pass is s(t) = g(t) ⊙ i(t) + f (t) ⊙ s(t − 1).
Output gate: The value vc ultimately produced by a memory cell is the value of the internal state sc multiplied by the value of the output gate oc . It is customary that the internal state first be run through a tanh activation function, as this gives the output of each cell the same dynamic range as an ordinary tanh hidden unit. However, in other works, rectified linear units, which have a greater dynamic range, are easier to train. So it seems plausible that the nonlinear function on the internal state might be omitted.
In the original paper and in most subsequent work, the input node is labeled g. We adhere to this convention but note that it may be confusing as g does not stand for gate. In the original paper, the gates are called yin and yout , but this is confusing because y generally stands for output in the machine learning literature. In the interests of clarity, we break with this convention and use i, f , and o to refer to input, forget, and output gates, respectively.
Computation in the LSTM model is presented by the following equations, performed at each time step. These equations give the full algorithm for a modern LSTM with forget gates:
(3.72)
The value of the hidden layer of the LSTM at time t is the vector h(t), while h(t − 1) is the values output by each memory cell in the hidden layer at the previous time. Note that these equations include the forget gate. The calculations for the simpler LSTM without forget gates are obtained by setting f (t) = 1 for all t. We use the tanh function ϕ for the input node g. However, in the original LSTM paper, the activation function for g is the sigmoid σ.
BRNNs: Besides the LSTM, one of the most used RNN architectures is the BRNN (Figure 3.20).
In this architecture, there are two layers of hidden nodes. Both hidden layers are connected to input and output. Only the first layer has recurrent connections from the past time steps, while in the second layer, the direction of recurrent connections is flipped, passing activation backward along the sequence. Given an input sequence and a target sequence, the BRNN can be trained by ordinary backpropagation after unfolding across time. The following three equations describe a BRNN:
(3.73)
where h(t) and z(t) are the values of the hidden layers in the forward and backward directions, respectively.
NTMs: The NTM extends RNNs with an addressable external memory [12]. This enables RNNs to perform complex algorithmic tasks such as sorting. This is inspired by the theories in cognitive science that suggest humans possess a “central executive” that interacts with a memory buffer [13]. By analogy with a Turing machine, in which a program directs read heads and write heads to interact with external memory in the form of a tape, the model is called an NTM.
The two primary components of an NTM are a controller and a memory matrix. The controller, which may be a recurrent or feedforward neural network, takes input and returns output to the outside world, as well as passing instructions to and reading from the memory. The memory is represented by a large matrix of N memory locations, each of which is a vector of dimension M. Additionally, a number of read and write heads facilitate the interaction between the controller and the memory matrix. Despite these additional capabilities, the NTM is differentiable end‐to‐end and can be trained by variants of SGD using Backpropagation through Time (BPTT).
Figure 3.20 A bidirectional recurrent neural network (BRNN). (for more details see the color figure in the bins).
In [12], five algorithmic tasks are used to test the performance of the NTM model. By algorithmic we mean that for each task, the target output for a given input can be calculated by following a simple program, as might be easily implemented in any universal programming language. One example is the copy task, where the input is a sequence of fixed length binary vectors followed by a delimiter symbol. The target output is a copy of the input sequence. In another task, priority sort, an input consists of a sequence of binary vectors together with a distinct scalar priority value for each vector. The target output is the sequence of vectors sorted by priority. The experiments test whether an NTM can be trained via supervised learning to implement these common algorithms correctly and efficiently. Interestingly, solutions found in this way generalize reasonably well to inputs longer than those presented in the training set. By contrast, the LSTM without external memory does not generalize well to longer inputs. The authors compare three different architectures, namely an LSTM RNN, and NTM, with a feedforward controller, and an NTM with an LSTM controller. On each task, both NTM architectures significantly outperform the LSTM RNN both in training set performance and in generalization to test data.