Читать книгу Deep Learning Approaches to Text Production - Shashi Narayan - Страница 15

Оглавление

CHAPTER 3

Deep Learning Frameworks

In recent years, deep learning, also called the neural approach, has been proposed for text production. The pre-neural approach generally relied on a pipeline of modules, each performing a specific subtask. The neural approach is very different from the pre-neural approach in that it provides a uniform (end-to-end) framework for text production. First the input is projected on a continuous representation (representation learning), and then, the generation process (generation) generates an output text using the input representation. Figure 3.1 illustrates this high-level framework used by neural approaches to text production.

One of the main strengths of neural networks is that they provide an amazing tool for representation learning. Representation learning often happens in a continuous space, such that different modalities of input, e.g., text (words, sentences, and even paragraphs), graphs, and tables are represented by dense vectors. For instance, given the user input “I am good. How about you? What do you do for a living?” in a dialogue setting, a neural network will first be used to create a representation of the user input. Then, in a second step—the generation step—this representation will be used as the input to a decoder which will generate the system response, “Ah, boring nine to five office job. Pays for the house I live in”, a text conditioned on that input representation. Representation learning aims at encoding relevant information from the input that is necessary to generate the output text. Neural networks have proven to be effective in representation learning without requiring directly extracting explicit features from the data. These networks operate as complex functions that propagate values (linear transformation of input values) through non-linear functions (such as the sigmoid or the hyperbolic tangent function) to get outputs that can be further propagated the same way to upper layers of the network.

This chapter introduces current methods in deep learning that are common in natural language generation. The goal is to give a basic introduction to neural networks in Section 3.1, and discuss the basic encoder-decoder approach [Cho et al., 2014, Sutskever et al., 2014] which has been the basis for much of the work on neural text production.

3.1BASICS

Central to deep learning is its ability to do representation learning by introducing representations that are expressed in terms of other simpler representations [Goodfellow et al., 2016].¹ Typically, neural networks are organised in layers; each layer consists of a number of interconnected nodes; each node takes inputs from the previous layer and applies linear transformation followed by a nonlinear activation function. The network takes an input through the input layer which communicates with one or more hidden layers and finally produces model predictions through the output layer. Figure 3.2 illustrates an example deep learning system. A key characteristic of deep learning systems is that they can learn complex concepts from simpler concepts through the use of nonlinear activation functions; the activation function of a node in a neural network defines the output of that node given an input or set of inputs. Several hidden layers are often stacked to learn more and more complex and abstract concepts, leading to a deep network.

Figure 3.1: Deep learning for text generation.

Figure 3.2: Feed-forward neural network or multi-layer perceptron.

What we just explained here is essentially a feed-forward neural network or multi-layer perceptron. Basically, it learns a function mapping a set of input values from the input layer to a set of output values in the output layer. The function is formed by composing many linear functions through nonlinear activations.² Such networks are called feed-forward because information always flows forward from the input layer to the output layer through the hidden layers in between. There is no autoregressive connection in which outputs of a layer are fed back to itself. Neural networks with autoregressive connections are often called recurrent neural networks (RNNs), they are widely explored for text production. We will discuss them later in this section.

3.1.1CONVOLUTIONAL NEURAL NETWORKS

Another type of neural networks, called convolutional neural networks, or CNNs [Lecun, 1989], are specialised for processing data that has a known grid-like topology. These networks have turned out to be successful in processing image data which can be represented as 2-dimensional grids of image pixels [Krizhevsky et al., 2012, Xu et al., 2015a], or time-series data from automatic speech recognition problems [Abdel-Hamid et al., 2014, Zhang et al., 2017]. In recent years, CNNs have also been applied to natural language. In particular, they have been use to effectively learn word representations for language modelling [Kim et al., 2016] and sentence representations for sentence classification [Collobert et al., 2011, Kalchbrenner et al., 2014, Kim, 2014, Zhang et al., 2015] and summarisation [Cheng and Lapata, 2016, Denil et al., 2014, Narayan et al., 2017, 2018a,c]. CNNs employ a specialised kind of linear operation called convolution, followed by a pooling operation, to build a representation that is aware of spatial interactions among input data points. Figure 3.3 from Narayan et al. [2018a] shows how CNNs can be used to learn a sentence representation. First of all, CNNs require input to be in a grid-like structure. For example, a sentence s of length k can be represented as a dense matrix W = [w₁ ⊕ w₂ ⊕ … ⊕ w_k] ∈ Rk×d where w_i ∈ Rd is a continuous representation of the ith word in s and ⊕ is the concatenation operator. We apply a one-dimensional convolutional filter K ∈ Rh×d of width h to a window of h words in s to produce a new feature.³ This filter is applied to each possible window of words in s to produce a feature map f = [f₁, f₂, … , f_k—h+1] ∈ R^k—h+1, where fi is defined as:

where º is the Hadamard product, followed by a sum over all elements, ReLU is a rectified linear activation and b ∈ R is a bias term. ReLU activation function is often used as it is easier to train and often achieves better performance than sigmoid or tanh functions [Krizhevsky et al., 2012]. A max-pooling over time [Collobert et al., 2011] is applied over the feature map f to get f_max = max(f) as the feature corresponding to this particular filter K. Multiple filters K_h of width h are often used to compute a list of features f^Kh. In addition, filters of varying widths are applied to learn a set of feature lists . Finally, all feature lists are concatenated to get the final sentence representation.

Figure 3.3: Convolutional neural network for sentence encoding.

We describe in Chapter 5 how such convolutional sentence encoders can be useful for better input understanding for text production. Importantly, through the use of convolutional filters, CNNs facilitate sparse interactions, parameter sharing and equivariant representations. We refer the reader to Chapter 9 of Goodfellow et al. [2016] for more details on these properties.

3.1.2RECURRENT NEURAL NETWORKS

Feed-forward and CNNs fail to adequately represent the sequential nature of natural languages. In contrast, RNNs provide a natural model for them.

RNNs updates its state for every element of an input sequence. Figure 3.4 presents an RNN on the left and its application to a natural language text “How are you doing?” on the right. At each time step t, it takes as input the previous state s_t−1 and the current input element xt, and updates its current state as:

where U and V are model parameters. At the end of an input sequence, it learns a representation, encoding information from the whole sequence.

Figure 3.4: RNNs applied to a sentence.

Figure 3.5: Long-range dependencies. The shown dependency tree is generated using the Stanford CoreNLP toolkit [Manning et al., 2014].

Most work on neural text production has used RNNs due to their ability to naturally capture the sequential nature of the text and to process inputs and outputs of arbitrary length.

3.1.3LSTMS AND GRUS

RNNs naturally permit taking arbitrary long context into account, and so implicitly capture long-range dependencies, a common phenomenon frequently observed in natural languages. Figure 3.5 shows an example of long-range dependencies in a sentence “The yogi, who gives yoga lessons every morning at the beach, is meditating.” A good representation learning method should capture that “the yogi” is the subject of the verb “meditating” in the sentence.

In practice, however, as the length of the input sequence grows, RNNs are prone to losing information from the beginning of the sequences due to vanishing and exploding gradients issues [Bengio et al., 1994, Pascanu et al., 2013]. This is because, in the case of RNNS, back propagation applies through a large number of layers (the multiple layers corresponding to each time step). Since back propagation updates the weights in proportion to the partial derivative (the gradients) of the loss, and because of the sequential multiplication of matrices as the RNN is unrolled, the gradient may become either very large, or (more commonly), very small, effectively causing weights to either explode or never change at the lower/earlier layers. Consequently, RNNs fail to adequately model the long-range dependencies of natural languages.

Figure 3.6: Sketches of LSTM and GRU cells. On the left, i, f, and o are the input, forget, and output gates, respectively. c and represent the memory cell contents. On the right, r and z are the reset and update gates, and h and are the cell activations.

Long short-term memory (LSTM, [Hochreiter and Schmidhuber, 1997]) and gated recurrent unit (GRU, [Cho et al., 2014]) have been proposed as alternative recurrent networks which are better prepared to learning long-distance dependencies. These units are better in learning to memorise only the part of the past that is relevant for the future. At each time step, they dynamically update their states, deciding on what to memorise and what to forget from the previous input.

The LSTM cell (shown in Figure 3.6, left) achieves this using input (i), forget (f), and output (o) gates with the following operations:

where W_* and b_* are LSTM cell parameters. The input gate (Eq. (3.3)) regulates how much of the new cell state to retain, the forget gate (Eq. (3.2)) regulates how much of the existing memory to forget, and the output gate (Eq. (3.4)) regulates how much of the cell state should be passed forward to the next time step. The GRU cell (shown in Figure 3.6, right), on the other hand, achieves this using update (z) and reset (r) gates with the following operations:

where W_* are GRU cell parameters. The update gate (Eq. (3.8)) regulates how much of the candidate activation to use in updating the cell state, and the reset gate (Eq. (3.9)) regulates how much of the cell state to forget. The LSTM cell has separate input and forget gates, while the GRU cell performs both of these operations together using its reset gate.

In a vanilla RNN, the entire cell state is updated with the current activation, whereas both LSTMs and GRUs have the mechanism to keep memory from previous activations. This allows recurrent networks with LSTM or GRU cells to remember features for a long time and reduces the vanishing gradient problems as the gradient back propagates through multiple bounded non-linearities.

LSTMs and GRUs have been very successful in modelling natural languages in recent years. They have practically replaced the vanilla RNN cell from recurrent networks.

3.1.4WORD EMBEDDINGS

One of the key strengths of neural networks is that representation learning happens in a continuous space. For example, an RNN learns a continuous dense representation of an input text by encoding the sequence of words making up that text. At each time step, it takes a word represented as a continuous vector (often called a word embedding). In sharp contrast to pre-neural approaches, where words were often treated as symbolic features, word embeddings provide a more robust and enriched representation of words, capturing their meaning, semantic relationships, and distributional similarities (similarity of context they appear in).

Figure 3.7 represents two-dimensional representation of word embeddings. As can be seen, words that often occur in a similar context (e.g., “battery” and “charger”) are mapped closer to each other compared to words that do not occur in a similar context (e.g., “battery” and “sink”). Word embeddings give a notion of similarity among words that look very different from each other in their surface forms. Due to this continuous representation, neural text-production approaches lead to a robust model and better generalisation compared to pre-neural approaches that use symbolic representations, making them brittle. Mikolov et al. [2013] further show that these word embeddings demonstrate compositional properties in distributional space, e.g., one could start from the word “queen” and get to the word “woman” following the direction from the word “king” to the word “man”.

Given a vocabulary V , we represent each word w ∈ V by a continuous vector ew ∈ Rd of length d. We define a word embedding matrix W ∈ R^|V|×d, representing each word in the vocabulary V. Earlier neural networks often used pre-trained word embeddings such as Word2Vec [Mikolov et al., 2013] or Glove [Pennington et al., 2014]. Using these approaches, the word embedding matrix W is learned in an unsupervised fashion from a large amount of raw text. Word2Vec adapts a predictive feed-forward model, aiming to maximise the prediction probability of a target word, given its surrounding context. Glove achieves this by directly reducing the dimensionality of the co-occurrence counts matrix. Importantly, embeddings learned from both approaches capture the distributional similarity among words. In a parallel trend to using pre-trained word embeddings, several other text-production models have shown that word embeddings can first be initialised randomly and then trained jointly with other network parameters; these jointly trained word embeddings are often fine tuned and better suited to the task.

Deep Learning Approaches to Text Production

Подняться наверх