Читать книгу Deep Learning Approaches to Text Production - Shashi Narayan - Страница 12
ОглавлениеCHAPTER 1
Introduction
In this chapter, we outline the differences between text production and text analysis, we introduce the main text-production tasks this book is concerned with (i.e., text production from data, from text, and from meaning representations) and we summarise the content of each chapter. We also indicate what is not covered and introduce some notational conventions.
1.1WHAT IS TEXT PRODUCTION?
While natural language understanding [NLU, Bates, 1995] aims to analyse text, text production, or natural language generation [NLG, Gatt and Krahmer, 2018, Reiter and Dale, 2000], focuses on generating texts. More specifically, NLG differs from NLU in two main ways (cf. Figure 1.1). First, unlike text analysis, which always takes text as input, text production has many possible input types, namely, text [e.g., Nenkova and McKeown, 2011], data [e.g., Wiseman et al., 2017], or meaning representations [e.g., Konstas et al., 2017]. Second, text production has various potential goals. For instance, the goal may be to summarise, verbalise, or simplify the input.
Correspondingly, text production has many applications depending on what the input (data, text, or meaning representations) and what the goal is (simplifying, verbalising, paraphrasing, etc.). When the input is text (text-to-text or T2T generation), text production can be used to summarise the input document [e.g., Nenkova and McKeown, 2011], simplify a sentence [e.g., Shardlow, 2014, Siddharthan, 2014] or respond to a dialogue turn [e.g., Mott et al., 2004]. When the input is data, NLG can further be used to verbalise the content of a knowledge [e.g., Power, 2009] or a database [e.g., Angeli et al., 2010], generate reports from numerical [e.g., Reiter et al., 2005] or KB data [e.g., Bontcheva and Wilks, 2004], or generate captions from images [e.g., Bernardi et al., 2016]. Finally, NLG has also been used to regenerate text from the meaning representations designed by linguists to represent the meaning of natural language [e.g., Song et al., 2017].
In what follows, we examine generation from meaning representations, data, and text in more detail.
1.1.1GENERATING TEXT FROM MEANING REPRESENTATIONS
There are two main motivations for generating text from meaning representations.
First, an algorithm that converts meaning representations into well-formed text is a necessary component of traditional pipeline NLG systems [Gatt and Krahmer, 2018, Reiter and Dale, 2000]. As we shall see in Chapter 2, such systems include several modules, one of them (known as the surface realisation module) being responsible for generating text from some abstract linguistic representation derived by the system. To improve reusability, surface realisation challenges have recently been organised in an effort to identify input meaning representations that could serve as a common standard for NLG systems, thereby fueling research on that topic.
Figure 1.1: Input contents and communicative goals for text production.
Second, meaning representations can be viewed as an interface between NLU and NLG. Consider translation, for instance. Instead of learning machine translation models, which directly translate surface strings into surface strings, an interesting scientific challenge would be to learn a model that does something more akin to what humans seem to do, i.e., first, understand the source text, and second, generate the target text from the conceptual representation issued from that understanding (indeed a recent paper by Konstas et al. [2017] mentions this as future work). A similar two-step process (first, deriving a meaning representation from the input text, and second, generating a text from that meaning representation) also seems natural for such tasks as text simplification or summarisation.
Although there are still relatively few approaches adopting a two-step interpret-and-generate process or reusing existing surface realisation algorithms, there is already a large trend of research in text production which focuses on generating text from meaning representations produced by a semantic parser [May and Priyadarshi, 2017, Mille et al., 2018] or a dialogue manager [Novikova et al., 2017b]. In the case of semantic parsing, the meaning representations capture the semantics of the input text and can be exploited as mentioned above to model a two-step process in applications such as simplification [Narayan and Gardent, 2014], summarisation [Liu et al., 2015] or translation [Song et al., 2019b]. In the case of dialogue, the input meaning representation (a dialogue move) is output by the dialogue manager in response to the user input and provides the input to the dialogue generator, the module in charge of generating the system response.
Figure 1.2: Input shallow dependency tree from the generation challenge surface realisation task for the sentence “The most troublesome report may be the August merchandise trade deficit due out tomorrow.”
While a wide range of meaning representations and syntactic structures have been proposed for natural language (e.g., first-order logic, description logic, hybrid logic, derivation rather than derived syntactic trees), three main types of meaning representations have recently gained traction as input to text generation: meaning representations derived from syntactic dependency trees (cf. Figure 1.2), meaning representations derived through semantic parsing (cf. Figure 1.3), and meaning representations used as input to the generation of a dialogue engine response (cf. Figure 1.4). All three inputs gave rise to shared tasks and international challenges.
The Surface Realisation shared task [Belz et al., 2012, Mille et al., 2018] focuses on generating sentences from linguistic representations derived from syntactic dependency trees and includes a deep and a shallow track. For the shallow track, the input is an unordered, lemmatised syntactic dependency tree and the main focus is on linearisation (deriving the correct word order from the input tree) and morphological inflection (deriving the inflection from a lemma and a set of morphosyntactic features). For the deep track, on the other hand, the input is a dependency tree where dependencies are semantic rather than syntactic and function words have been removed. While the 2011 shared task only provided data for English, the 2018 version is multilingual and includes training data for Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian, and Spanish (shallow track), and English, French, and Spanish (deep track).
Figure 1.3: Example input from the SemEval AMR-to-Text Generation Task.
Figure 1.4: E2E dialogue move and text.
In the SemEval-2017 Task 9 Generation Subtask [May and Priyadarshi, 2017], the goal is to generate text from an Abstract Meaning Representation (AMR, cf. Figure 1.3), a semantic representation which includes entity identification and typing, PropBank [Palmer et al., 2005] semantic roles, entity grounding via wikification, as well as treatments of modality and negation. The task and the training data are restricted to English.
Finally, following on previous work targeting the generation of a system response from a meaning representation consisting of a speech act (e.g., instruct, query, recommend) and a set of attribute-value pairs [Mairesse and Young, 2014, Wen et al., 2015], the E2E challenge [Novikova et al., 2017b] targets the generation of restaurant descriptions from sets of attribute-value pairs.1
1.1.2GENERATING TEXT FROM DATA
Data is another common source of input for text production with two prominent data types, namely table and knowledge-base data.2 For instance, Angeli et al. [2010] show how generation can be applied to sportscasting and weather forecast data [Reiter et al., 2005]. Konstas and Lapata [2012a,b] generate text from flight booking data, Lebret et al. [2016] from Wikipedia, and Wiseman et al. [2017] from basketball games box- and line-score tables. In those cases, the input to generation are tables containing records with an arbitrary number of fields (cf. Figure 1.5).
There has also been much work on generating text from knowledge bases. Bontcheva and Wilks [2004] generate patient records from clinical data encoded in the RDF format (Resource Description Framework). Power [2009] generates text from whole knowledge bases encoded in OWL or description logic. And, more recently, Perez-Beltrachini and Lapata [2018] have investigated the generation of sentences and short texts from RDF-encoded DBPedia data.
Whereas in generation from AMRs and dependency-based meaning representations, there is often an almost exact semantic match between input and output, this is not the case in data-to-text generation or generation from dialogue moves. As illustrated by the examples shown in Figure 1.5, there is no direct match between input data and output text. Instead, words must be chosen to lexicalise the input KB symbols, function words must be added, ellipsis and coordination must be used to avoid repetitions, and sometimes, content selection must be carried out to ensure that the output text adequately resembles human produced text.
1.1.3GENERATING TEXT FROM TEXT
The third main strand of research in NLG is text-to-text generation. While for meaning representations and data-to-Text generation the most usual communicative goal is to verbalise the input, text-to-text generation can be categorised into three main classes depending on whether the communicative goal is to summarise, simplify, or paraphrase.
Text summarisation has various possible inputs and outputs. The input may consist of multiple documents in the same [Dang, 2006, Hachey, 2009, Harman and Over, 2004] or multiple languages [Filatova, 2009, Giannakopoulus et al., 2017, Hovy and Lin, 1997, Kabadjov et al., 2010, Lloret and Palomar, 2011]; a single document [Durrett et al., 2016]; or a single (complex) sentence [Chopra et al., 2016, Graff et al., 2003, Napoles et al., 2012]. The latter task is often also referred to as “Sentence Compression” [Cohn and Lapata, 2008, Filippova and Strube, 2008, Filippova et al., 2015, Knight and Marcu, 2000, Pitler, 2010, Toutanova et al., 2016] while a related task, “Sentence Fusion”, consists of combining two or more sentences with overlapping information content, preserving common information and deleting irrelevant details [Filippova, 2010, McKeown et al., 2010, Thadani and McKeown, 2013]. As for the output produced, research on summarisation has focused on generating either a short abstract [Durrett et al., 2016, Grusky et al., 2018, Sandhaus, 2008], a title [Chopra et al., 2016, Rush et al., 2015], or a set of headlines [Hermann et al., 2015].
Figure 1.5: Data-to-Text example input and output (source: Konstas and Lapata [2012a]).
Text paraphrasing aims to rewrite a text while preserving its meaning [Bannard and Callison-Burch, 2005, Barzilay and McKeown, 2001, Dras, 1999, Mallinson et al., 2017, Wubben et al., 2010], while text simplification targets the production of a text that is easier to understand [Narayan and Gardent, 2014, 2016, Siddharthan et al., 2004, Woodsend and Lapata, 2011, Wubben et al., 2012, Xu et al., 2015b, Zhang and Lapata, 2017, Zhu et al., 2010].
Both paraphrasing and text simplification have been shown to facilitate and/or improve the performance of natural language processing (NLP) tools. The ability to automatically generate paraphrases (alternative phrasings of the same content) has been demonstrated to be useful in several areas of NLP such as question answering, where they can be used to improve query expansion [Riezler et al., 2007]; semantic parsing, where they help bridge the gap between a grammar-generated sentence and its natural counterparts [Berant and Liang, 2014]; machine translation [Kauchak and Barzilay, 2006]; sentence compression [Napoles et al., 2011]; and sentence representation [Wieting et al., 2015], where they help provide additional training or evaluation data. From a linguistic standpoint, the automatic generation of paraphrases is an important task in its own right as it demonstrates the capacity of NLP techniques to simulate human behaviour.
Because shorter sentences are generally better processed by NLP systems, text simplification can be used as a pre-processing step which facilitates and improves the performance of parsers [Chandrasekar and Srinivas, 1997, Jelínek, 2014, McDonald and Nivre, 2011, Tomita, 1985], semantic role labelers [Vickrey and Koller, 2008], and statistical machine translation (SMT) systems [Chandrasekar et al., 1996]. Simplification also has a wide range of potential societal applications as it could be of use for people with reading disabilities [Inui et al., 2003] such as aphasia patients [Carroll et al., 1999], low-literacy readers [Watanabe et al., 2009], language learners [Siddharthan, 2002], and children [De Belder and Moens, 2010].
1.2ROADMAP
This book falls into three main parts.
Part I sets up the stage and introduces the basic notions, motivations, and evolutions underlying text production. It consists of three chapters.
Chapter 1 (this chapter) briefly situates text production with respect to text analysis. It describes the range of input covered by text production, i.e., meaning representations, data, and text. And it introduces the main applications of text-production models which will be the focus of this book, namely, automatic summarisation, paraphrasing, text simplification, and data verbalisation.
Chapter 2 summarises preneural approaches to text production, focusing first on text production from data and meaning representations, and second, on text-to-text generation. The chapter describes the main assumptions made for these tasks by pre-neural approaches, setting up the stage for the following chapter.
Chapter 3 shows how deep learning introduced a profound change of paradigm for text production, leading to models which rely on very different architectural assumptions than pre-neural approaches and to the use of the encoder-decoder model as a unifying framework for all text-production tasks. It then goes on to present a basic encoder-decoder architecture, the sequence-to-sequence model, and shows how this architecture provides a natural framework both for representing the input and for generating from these input representations.
Part II summarises recent progress on neural approaches to text production, showing how the basic encoder-decoder framework described in Chapter 3 can be improved to better model the characteristics of text production.
While neural language models demonstrate a strong ability to generate fluent, natural sounding text given a sufficient amount of training data, a closer look at the output of text-production systems reveals several issues regarding text quality which have repeatedly been observed across generation tasks. The output text may contain information not present in the input (weak semantic adequacy) or, conversely, fail to convey all information present in the input (lack of coverage); repetitions (stuttering) frequent (diminished grammaticality and fluency); and rare or unknown input units may be incorrectly or not at all verbalised. Chapter 4 discusses these issues and introduces three neural mechanisms standardly used in text production to address them, namely, attention, copy, and coverage. We show how integrating these additional features in the encoder-decoder framework permits generating better text by improving the decoding part of NLG systems. We also briefly mention alternative methods that were proposed in the literature.
In contrast to Chapter 4, which shows how to improve the decoding part of the encoder-decoder framework, Chapter 5 focusing on encoding and shows how encoders can be modified to better take into account the structure of the input. Indeed, relying on the impressive ability of sequence-to-sequence models to generate text, most of the earlier work on neural approaches to text production systematically encoded the input as a sequence. However, the input to text production often has a non-sequential structure. In particular, knowledge-base fragments are generally viewed as graphs, while the documents which can make up the input to summarisation systems are hierarchical structures consisting of paragraphs which themselves consist of words and sentences. Chapter 5 starts by outlining the shortcomings arising from modelling these complex structures as sequences. It then goes on to introduce different ways in which the structure of the input can be better modelled. For document structure, we discuss hierarchical long short-term memories (LSTMs), ensemble encoders, and hybrid convolutional sequence-to-sequence document encoders. We then review the use of graph-to-sequence, graph-based triple encoders and graph convolutional networks as means to capture the graph structure of e.g., knowledge-based data and Abstract Meaning Representations (AMRs).
Chapter 6 focuses on ways of guiding the learning process so that constraints stemming from the communication goals are better captured. While the standard encoder-decoder framework assumes learning based on the ground truth, usually using cross-entropy, more recent approaches to text production have started investigating alternative methods such as reinforcement learning and multi-task learning (the latter allows signal from other complementary tasks). In Chapter 6, we review some of these approaches, showing for instance how a simplification system can be learned using deep reinforcement learning with a reward of capturing key features of simplified text such as whether it is fluent (perplexity), whether it is different from the source (SARI metrics), and whether it is similar to the reference (BLEU).
Finally, Part III reviews some of the most prominent data sets used in neural approaches to text production (Chapter 7) and mentions a number of open challenges (Chapter 8).
1.3WHAT’S NOT COVERED?
While natural language generation has long been the “parent pauvre” of NLP, with few researchers, small workshops, and relatively few publications, “the effectiveness of neural networks and their impressive ability to generate fluent text”3 has spurred massive interest for the field over the last few years. As a result, research in that domain is progressing at high speed, covering an increasingly large number of topics. In this book, we focus on introducing the basics of neural text production, illustrating its workings with some examples from data-to-text, text-to-text, and meaning-representations-to-text generation. We do not provide an exhaustive description of the state of the art for these applications, however, nor do we cover all areas of text production. In particular, paraphrasing and sentence compression are only briefly mentioned. Generation from videos and image, in particular caption generation, are not discussed,4 as is the whole field of automated creative writing, including poetry generation and storytelling. Evaluation is only briefly discussed. Finally, novel models and techniques (e.g., transformer models, contextual embeddings, and generative models for generation) which have recently appeared are only briefly discussed in the conclusion.
1.4OUR NOTATIONS
We represent words, sentences, documents, graphs, word counts, and other types of observations with Roman letters (e.g., x, w, s, d, W, S, and D) and parameters with Greek letters (e.g., α, β, and θ). We use bold uppercase letters to represent matrices (e.g., X, Y, and Z), and bold lowercase letters to represent vectors (e.g., a, b, and c) for both random variables x and parameters θ. We use [a; b] to denote vector concatenation. All other notations are introduced when they are used.
1In this case, the speech act is omitted as it is the same for all inputs namely, to recommend the restaurant described by the set of attribute-value pairs.
2Other types of input data have also been considered in NLG such as numerical, graphical and sensor data. We omit them here as, so far, these have been less often considered in neural NLG.
3 http://karpathy.github.io/2015/05/21/rnn-effectiveness/
4See Bernardi et al. [2016], Gatt and Krahmer [2018] for surveys of automatic description generation from images and of the Vision-Language Interface.