Читать книгу Deep Learning Approaches to Text Production - Shashi Narayan - Страница 14

Оглавление

CHAPTER 2

Pre-Neural Approaches

In this chapter, we briefly review the main architectures that were prevalent in pre-neural text generation. These architectures focus on modelling multiple, interacting factors and differ depending on the NLG task they address. More specifically, three main types of pre-neural NLG architectures can be distinguished depending on whether the task is to generate from data from meaning representations or text.

This chapter sets up the stage for the following chapters where early neural approaches will be shown, usually to model text production as a single end-to-end process independently of the specific NLG task being considered and of the number of factors being involved. As we shall see, contrary to pre-neural approaches which use distinct architectures for different NLG tasks, early neural NLG models mostly consist of two sub-modules: an encoder which maps the input into a continuous representation and a decoder which incrementally generates text based on this continuous representation and on the representation of previously generated words.

2.1DATA-TO-TEXT GENERATION

Generating text from data is a multiple-choice problem. Consider, for instance, the example input data and the associated text shown in Figure 2.1.

Generating this text involves making the following choices.

Content Selection: deciding which parts of the input data should be realised by the text. For instance, in our example, the pass from “purple6” to “purple3” is not mentioned. More generally, a generation system needs to decide which parts of the input should be realised in the output text and which information can be omitted.

Document Planning: finding an appropriate text structure. In our example, the text structure is simple and consists of a single sentence. When the input data is larger however, document planning requires making decisions regarding the number of sentences to be generated, their respective ordering, and the discourse relations holding between them.

Lexicalisation: determining which words should be chosen to realise the input symbols. For instance, the input symbol “badPass” is lexicalised by the verbal phrase “to make a bad pass”.

Referring Expression Generation: determining which referring expression (pronoun, proper name, definite description, etc.) should be used to describe input entities. In our example, all entities are referred to by identifiers which directly translate into proper names. In more complex settings, however, particularly when the generated text is longer and contains multiple references to the same entity, it is necessary to decide whether to use, a pronoun or a definite description. Furthermore, when using a definite description (e.g., “the man”), the content of the description needs to be determined: should it be, e.g., “the man with the hat”, “the tall man with a hat”, or “the president”?


Figure 2.1: A Robocup input and output pair example.

Aggregation: using anaphora, ellipsis, coordination, and generally any convenient syntactic means of avoiding repeating the verbalisation of data fragments which occur multiple times in the input. For instance, the use of the subject relative clause in the example above (“purple3 made a bad pass that was picked off by pink9”) permits generating a more compact text than the alternative “purple3 made a bad pass to pink9. There was a turnover from pink3 to pink9”.

Surface Realisation: deciding which syntactic constructs should be exploited to combine the selected lexical items into a well-formed output text. For instance, in the example above, surface realisation determined the use of two verbs, one in the passive voice, the other one in the active voice, both verbs being connected by a subject relative clause.

Figure 2.2 shows how these different sub-tasks can be modeled in a pipeline architecture. As discussed in Gatt and Krahmer [2018], alternative architectures have been explored reflecting different levels of division between modules. Modular approaches maintain a strict division between each sub-task, implementing the interactions between them using either a pipeline, a revision, or a constrained architecture. Planning-based approaches view language production as a goal-driven task (following a “language as action” paradigm) and provide a unifying framework where all modules are implemented by a given set of actions. Finally, integrated or global approaches cut across task divisions, implementing all modules via a single unifying mechanism (e.g., probabilistic context-free grammar or discriminative classifiers) and jointly optimising the various generation sub-tasks.

Overall, though, distinct generation sub-tasks are implemented separately. For instance, even in global, joint optimisation approaches such as Konstas and Lapata [2012a], different rule sets are defined to implement content selection, text planning, and surface realisation. Similarly, in Angeli et al. [2010], three distinct classifiers are used to implement each of these components.


Figure 2.2: Data-to-Text: A pipeline architecture (source: Johanna Moore).

2.2MEANING REPRESENTATIONS-TO-TEXT GENERATION

For meaning representation to text (MR-to-text) generation, two main types of approaches can be distinguished depending on the nature of the input, the amount of training data available, and on the type of text to be produced: grammar-centric vs. statistical.

Grammar-centric approaches are used when the generation task mainly involves surface realisation, i.e., when the gap between input MR and output text is small and the generation task mainly consists of linearising and inflecting some input structure (usually a graph or an unordered tree) whose nodes are decorated with lemmas or words. In that case, a grammar can be used to define a mapping between input MR and output text. Additional heuristic or statistical modules are usually added to handle the ambiguity induced by the strong non-determinism of natural-language grammars.

Statistical approaches often provided a more robust solution compared to grammar-centric approaches. They used machine learning to handle a wider range of choices, going beyond just handling syntactic ambiguity resolution.

2.2.1GRAMMAR-CENTRIC APPROACHES

In a grammar-centric approach, a grammar describing the syntax of natural language is used to mediate between meaning and text. The grammar can be handwritten [Gardent and Perez-Beltrachini, 2017] or automatically extracted from parse trees [Gyawali and Gardent, 2014, White, 2006]. Because natural language is highly ambiguous, grammar-centric approaches additionally integrate heuristics or statistical modules for handling non-determinism. Three main types of disambiguation filters have been used: filters that reduce the initial search space (often called hypertaggers), filters that eliminate unlikely intermediate structures, and filters that select the best solution from the set of output solutions (usually referred to as rankers).

Pruning the Initial Search Space. Bottom-up grammar-based approaches first select the set of lexical entries/grammar rules which are relevant given the input. For instance, if the input contains the predicate “book”, the lexical entries for “book” will be selected. Because the grammar and the lexicon of a natural language are highly ambiguous, this first step (lexical selection) yields a very large input space where the number of possible combinations to be considered is the cartesian product of the number of entries/rules selected for each input token. For instance, if the input consists of 10 tokens and each token selects an average of 5 entries, the number of possible combinations to be explored is 510.

To reduce this initial search space, Espinosa et al. [2008] introduced hypertagging, a technique adapted from the supertagging method first proposed for parsing by Bangalore and Joshi [1999]. In essence, a hypertagger is a classifier which was trained to select for each input token, the n most probable grammar rules/lexical entries, with n a small number. Gardent and Perez-Beltrachini [2017] present an interesting application of hypertagging which shows that hypertagging can be used to support the generation of well-formed natural language queries from description logic formulae. In that case, the hypertagger is shown not only to reduce the initial search space, but also to make choices that correctly capture the interactions between the various micro-planning operations (lexicalisation, referring expression generation, sentence segmentation, aggregation, and surface realisation).

Other techniques used to filter the initial search space include polarity filtering, a technique which rules out any combination of grammar rules such that the syntactic requirements of verbs are not exactly satisfied [Gardent and Kow, 2007], and hybrid bottom-up, top-down filtering, where the structure of the input is used both top-down—to constrain the selection of applicable rules—and bottom-up, to filter the initial search space associated with local input trees [Narayan and Gardent, 2012].

Filtering out Intermediate Structures. Carroll and Oepen [2005] present a subsumption-based local ambiguity factoring and a procedure to selectively unpack the generation forest according to a probability distribution given by a conditional, discriminative classifier to filter out unlikely, intermediate structures.

To address the fact that there are n! ways to combine any n modifiers with a single constituent, White [2004] proposes to use a language model to prune the chart of identical edges representing different modifier permutations, e.g., to choose between “fierce black cat” and “black fierce cat.” Similarly, Bangalore and Rambow [2000] assumes a single derivation tree that encodes a word lattice (“a {fierce black, black fierce} cat”) and uses statistical knowledge to select the best linearisation while Gardent and Kow [2007] propose a two-step surface realisation algorithm for FB-LTAG (Feature-Based Lexicalised Tree-Adjoining Grammar) where, first, substitution is applied to combine trees together and, second, adjunction is applied to integrate modifiers and long-distance dependencies.

Ranking. Two main approaches have been used to rank the output of grammar-based sentence generators. Early approaches simply apply language model n-gram statistics to rank alternatives [Bangalore and Rambow, 2000, Langkilde, 2000, Langkilde-Geary, 2002]. Discriminative disambiguation models were later proposed which used linguistically motivated features, often additionally using language model scores as an additional feature [Nakanishi et al., 2005, Velldal and Oepen, 2006].

2.2.2STATISTICAL MR-TO-TEXT GENERATION

Because they permit modelling a wider range of transformations than grammars such as, for instance, aggregation, document planning, and referring expression generation, statistical approaches are generally favoured when the input meaning representation is deeper, i.e., when it abstracts away from surface differences and involves more complex MR-to-text transformations. They are, in particular, prevalent when the input MR is an Abstract Meaning Representation (AMR) or a deep unordered dependency tree.

For AMR-to-text generation, Flanigan et al. [2016] convert input graphs to trees by splitting reentrencies (nodes with multiple parents), and then translating the trees into sentences using a tree-to-string transducer. Song et al. [2017] use a synchronous node replacement grammar to simultaneously parse input AMRs and generate sentences. Pourdamghani et al. [2016] first linearise input graphs by breadth-first traversal, and then use a phrase-based machine translation (MT) system to generate text from the resulting linearised sequences. Castro Ferreira et al. [2017] frame generation as a translation task, comparing two different Machine Translation (MT) approaches (phrase-based and neural MT) and providing a systematic study of the impact of 3 AMR pre-processing steps (delexicalisation, compression, and linearisation) applied before the MT phase.

For deep dependency trees, Bohnet et al. [2010] uses a cascade of support vector machine (SVM) classifiers whereby an initial classifier decodes semantic input into the corresponding syntactic features, while two subsequent classifiers first linearise the syntax and then perform morphological realisation to inflect the input lemmas.

Statistical approaches have also been used to generate from shallow dependency trees. For instance, Filippova and Strube [2007, 2009] propose a two-step linearisation approach using maximum entropy classifiers, first determining which constituent should occupy sentence-initial position, then ordering the constituents in the remainder of the sentence.


Figure 2.3: Simplifying a sentence.

2.3TEXT-TO-TEXT GENERATION

Just like data- and MR-to-text production usually decompose the generation task into several sub-tasks, pre-neural Text-to-Text generation focuses on modelling several operations and their interactions. Simplification, compression, and paraphrasing are generally viewed as involving all or some of four main operations: sentence splitting, phrase rewriting, phrase reordering, and phrase deletion, while summarisation is generally decomposed into a three-step process involving content selection (selecting the key information in the input), aggregation (grouping together related information), and generalisation (abstracting from the input document to produce a better text).

2.3.1SENTENCE SIMPLIFICATION AND SENTENCE COMPRESSION

Sentence Simplification. As illustrated in Figure 2.3, sentence simplification maps a sentence to a simpler, more readable text approximating its content. Typically, a simplified text differs from the original version in that it involves simpler words, simpler syntactic constructions, shorter sentences, and fewer modifiers. In pre-neural approaches, simplification has thus often been modeled using four main operations:

splitting a complex sentence into several simpler sentences;

deleting words or phrases;

reordering phrases or constituents;

substituting words/phrases with simpler ones.

As for data-to-text generation, existing approaches vary in whether they capture all or some of these operations and on how these operations are integrated (pipeline vs. joint approach).

Earlier work on sentence simplification focused mostly on splitting and rewriting, relying on handcrafted rules to capture syntactic simplification, e.g., to split coordinated and subordinated sentences into several, simpler clauses or to model active/passive transformations [Bott et al., 2012, Canning, 2002, Chandrasekar and Srinivas, 1997, Siddharthan, 2002, 2011].


Figure 2.4: A Sentence/Compression pair.

A large strand of work has focused on developing machine learning approaches to sentence simplification and used the parallel data set formed by Simple English Wikipedia (SWKP)1 and traditional English Wikipedia (EWKP)2. Zhu et al. [2010] learned a simplification model inspired by syntax-based machine translation techniques [Yamada and Knight, 2001], which encodes the probabilities for the four rewriting operations (substitution, reordering, splitting, and deletion) on the parse tree of the input sentence. It is combined with a language model to improve grammaticality and the decoder translates sentences into simpler ones by greedily selecting an output sentence with the highest probability. Woodsend and Lapata [2011] learn a quasisynchronous grammar [Smith and Eisner, 2006] describing a loose alignment between parse trees of complex and simple sentences. Following Dras [1999], they then generate all possible rewrites of a source tree and use integer linear programming to select the best simplification. In Coster and Kauchak [2011], Wubben et al. [2012], simplification is viewed as a monolingual translation task where a complex sentence is the source and a simpler one is the target. To account for deletions, reordering and substitution, Coster and Kauchak [2011] trained a phrasebased machine translation system on the PWKP corpus while modifying the word alignment output by GIZA++ in Moses to allow for empty alignments. Similarly, Wubben et al. [2012] train a phrase-based machine translation system augmented with a post hoc reranking procedure designed to rank the outputs based on their dissimilarity from the source. Finally, Narayan and Gardent [2014] present a hybrid approach combining a probabilistic model for sentence splitting and deletion with a statistical machine translation system trained on PWKP for substitution and reordering.

Sentence Compression. Most work on sentence compression is extractive.3 That is, the generated compressions are composed of sub-sequences of the input sentence. As a result, work on sentence compression mainly focuses on deletion. However, syntax is often used to optimise the well-formedness of the compressed output. This partially captures syntactic transformations that may result from extractive compression.

Dorr et al. [2003], Jing and McKeown [2000], Zajic et al. [2007] rely on rule-based approaches to determine which words should be kept and which should be deleted. Galanis and Androutsopoulos [2010], Gillick and Favre [2009], Knight and Marcu [2000], Turner and Charniak [2005] developed supervised models trained on parallel data. Finally, Clarke and Lapata [2006], Filippova and Strube [2008], Woodsend and Lapata [2011], Woodsend et al. [2010] present unsupervised approaches which rely on integer linear programming and a set of linguistically motivated constraints to infer globally optimal compressions.

Sentence Paraphrasing. Unsurprisingly, work on paraphrasing has mainly been concerned with modelling phrase reordering and substitution. Pre-neural approaches include rule-based approaches, drawing on linguistic resources and approaches based on statistical machine translation (SMT).

Thus, McKeown [1983] present a rule-based method for deriving a paraphrase from the input sentence and its syntactic tree. Bolshakov and Gelbukh [2004] use WordNet and internet statistics of stable word combinations (collocations) to generate paraphrases: words or expressions are substituted with WordNet synonyms only if the latter form valid collocations with the surrounding words according to the statistics gathered from internet.

Paraphrase generation can be viewed as a monolingual machine translation task. Building on this intuition, Quirk et al. [2004] train a paraphrasing SMT system on large volumes of sentence pairs automatically extracted from clustered news articles available on the World Wide Web. Zhao et al. [2008] use SMT and multiple resources to generate paraphrases. Specifically, a phrasal paraphrase table and a feature function are derived from each resource, which are then combined in a log-linear SMT model for sentence-level paraphrase generation. Similarly, Napoles et al. [2016] provide a black-box machine translation model which uses the PPDB paraphrase database and a statistical machine translation model to generate paraphrases.

Grammar-based models of paraphrases have also been explored. For instance, Narayan et al. [2016] introduces a grammar model for paraphrase generation which is trained on the Paralex corpus, a large monolingual parallel corpus, containing 18 million pairs of question paraphrases.

2.3.2DOCUMENT SUMMARISATION

Intuitively, humans perform summarisation by extracting the key information contained in the input document, aggregating this information, and abstracting from the resulting aggregated text. That is, humans create a summary by abstracting over its content. In practice, however, much of the pre-neural work on automatic text summarisation has focused on extractive summarisation, an approach which simply extracts key sentences from the input document and combines them to form a summary. The difference is illustrated in Figure 2.5. While abstractive summarisation reformulates the selected content, extractive approaches simply stitch together text fragments selected by the extraction step.

There is a vast literature on pre-neural text summarisation. For an overview of this work, we refer the reader to Mani [1999], Nenkova and McKeown [2012], Nenkova et al. [2011].


Figure 2.5: Abstractive vs. extractive summarisation.

Here, we briefly survey some key techniques used in the three main steps usually involved in extractive summarisation, namely:

• creating an intermediate representation of the sentences contained in the input text;

• scoring these sentences based on that representation; and

• creating a summary by selecting the most relevant sentences [Nenkova et al., 2011].

Representing Sentences. Work on extractive summarisation has explored two main ways of representing text: topic and indicators representations.

In topic-based approaches, an input document is assigned a topic representation which captures what the text is about. This topic representation is then used to extract from the input those sentences that are strongly topic-related. Several approaches have been proposed to derive these topic representations.

Frequency-based approaches use counts to identify those words that represent a document topic. Luhn [1958] used a frequency threshold to identify frequent content words in a document as descriptive of the document’s topic. Similarly, Conroy et al. [2006] used the log-likelihood ratio to extract those words that have a likelihood statistic greater than what one would expect by chance. Other approaches have used tf.idf ratio and word probability [Goldstein et al., 2000, Luhn, 1958, Nenkova and Vanderwende, 2005].

Latent semantic analysis (LSA) [Dumais et al., 1988, Gong and Liu, 2001] and Bayesian topic models [Celikyilmaz and Hakkani-Tur, 2010, Daumé III and Marcu, 2006, Haghighi and Vanderwende, 2009, Wang et al., 2009] exploit word co-occurrences to derive an implicit representation of text topics.


Figure 2.6: A document/summary pair from the CNN/DailyMail data set.

Finally, lexical chains [Barzilay and Elhadad, 1997, Galley and McKeown, 2003, Silber and McCoy, 2002] have been proposed to capture the intuition that topics are expressed not by single words but by sets of related words. For example, the words “asana”, “pranayama”, “breathing”, “body”, “soul” indicate a clear topic, even if each of the words is not by itself very frequent. Based on the lexical relations (synonymy, antonymy, part-whole, and general-specific) contained in WordNet, lexical chains track the prominence of different topics discussed in the input by measuring the occurrence of words that are lexically related to each of these topics.

In contrast to topic-based approaches, indicators approaches do not rely on a single topic representation, but on different text-based indicators such as the position of a sentence in the input document or its similarity with the document title [Kupiec et al., 1995]. Two main types of indicator methods can be distinguished: graph-based and vectorial.

In the graph-based approach [Erkan and Radev, 2004, Mihalcea and Tarau, 2004], an input text is represented as a graph whose vertices represent sentences and where edge labels indicate sentence similarity. Sentences that are related to many other sentences are likely to be central and have high weight for selection in the summary.

Vectorial approaches represent input sentences as feature vectors which can then be exploited by classifiers to determine whether or not a given input sentence should be part of the extracted summary [Hakkani-Tur and Tur, 2007, Leskovec et al., 2005, Lin and Hovy, 2000, Louis et al., 2010, Osborne, 2002, Wong et al., 2008, Zhang et al., 2013, Zhou and Hovy, 2003]. In addition to the topic features that are classically derived by topic-based approaches, common features include the position of the sentence in the document (in news articles, first sentences are almost always informative), position in the paragraph (first and last sentences are often important), sentence length, similarity of the sentence with the document title or headings, weights of the words in a sentence determined by any topic representation approach, presence of named entities or cue phrases from a predetermined list, etc.

Scoring Sentences. Based on whichever text representation has been created, each sentence is then assigned a score indicating its importance. For topic representation approaches, the importance of a sentence is usually computed as the number or the proportion of topic words it contains. For vectorial methods, the weight of each sentence is determined by combining the evidence from the different indicators using machine learning techniques to discover feature weights. In the multi-document summarisation LexRank system, the weight of each sentence is derived by applying stochastic techniques to the graph representation of the text [Erkan and Radev, 2004].

Selecting Summary Sentences. The last step consists of selecting the best combination of important sentences to form a paragraph length summary. The extracted summary should obey three main constraints: It should not exceed a given length, it should contain all relevant information, and it should avoid repetitions.

Most summarisation approaches choose content greedily by incrementally selecting the most informative (highest-scoring) sentences until the length limit is reached. A common strategy for greedily constructing a summary one sentence at a time is maximal marginal relevance (MMR) [Carenini et al., 2007], where, at each step, the algorithm is constrained to select a sentence that is maximally relevant and minimally redundant with sentences already included in the summary. Relevance and novelty are measured separately and then combined using some linear combination to produce a single score determining the importance of a sentence at a given stage of the selection process.

Sentence selection global optimisation algorithms have also been proposed to jointly maximise informativeness, minimise repetition, and conform to summary length restrictions [Gillick et al., 2009, Riedhammer et al., 2008].

2.4SUMMARY

In this chapter, we briefly reviewed pre-neural approaches to text-production. We saw that these approaches typically decompose the text-production task into several interacting subtasks which vary depending on the specific text-production task being considered. Figure 2.7 illustrates this intuition.

Data-to-text production typically involves text planning (selecting and structuring content) and sentence planning (choosing words, syntactic structures, and means of avoiding repetitions as well as choosing appropriate referring expressions to describe input entities). Simplification, paraphrasing, and compression involve modelling all or some of four operations, namely, phrase rewriting, reordering and deletion, and sentence splitting. Finally, summarisation can be viewed as encompassing three main modules: content selection (identifying key information), aggregation (structuring key information into a coherent text plan), and generalisation (using linguistic means to generate a naturally sounding, fluent summary).


Figure 2.7: Key modules in pre-neural approaches to text production.

In Chapter 3, we will see that initial neural approaches to text production markedly differ from these pre-neural approaches in that they provide a single, unifying framework, moving away from a decomposition of the text-production task into multiple subtasks. In Chapters 6 and 8, we will then see how more recent work focuses on integrating key ideas from these pre-neural approaches back into the novel deep learning paradigm.

1SWKP (http://simple.wikipedia.org) is a corpus of simple texts targeting children and adults who are learning English Language and whose authors are requested to use easy words and short sentences.

2 http://en.wikipedia.org

3There are some exceptions. For example, the task of title generation [Filippova and Altun, 2013b] or sentence summarisation [Rush et al., 2015] can be treated as abstractive sentence compression.

Deep Learning Approaches to Text Production

Подняться наверх