Читать книгу Computational Statistics in Data Science - Группа авторов - Страница 85
6.2 Architecture
ОглавлениеAs illustrated in Figure 7, a general neural network takes in input and outputs . The output of one sample will not influence the output of another sample. To capture the dependence between inputs, RNN adds a loop to connect the previous information with the current state. The graph on the left side of Figure 8 shows the structure of RNN, which has a loop connection to leverage previous information.
RNN can work with sequence data, which has input as sequence or target as sequence or both. An input sequence data can be denoted as , where each data point is a real‐valued vector. Similarly, the target sequence can be denoted as . A sample from the sequence dataset is typically a pair of one input sequence and one target sequence. The right side of Figure 8 shows the information passing process. At , network takes in a random initialed vector together with and outputs , and then at , takes in both and and outputs . This process is repeated over all data points in the input sequence.
Figure 7 Feedforward network.
Figure 8 Architecture of recurrent neural network (RNN).
Though multiple network blocks are shown on the right side of Figure 8, they share the same structure and weights. A simple example of the process can be written as
(9)
where and are weight matrices of network , is an activation function, and is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output can be passed into another network to make prediction. For a sequence‐to‐sequence model, can be generated based on and then compared with .
However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss is evaluated at , the gradient w.r.t. calculated via backpropagation can be written as
(10)
where is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so
(11)
Therefore, , and is always smaller than 1. When becomes larger, the gradient will get closer to zero, making it hard to train the network and update the weights with remote information. However, it is possible that relevant information is far apart in the sequence, so how to leverage remote information of a long sequence is important.