Читать книгу Automatic Text Simplification - Horacio Saggion - Страница 10

Оглавление

CHAPTER 1

Introduction

Automatic text simplification is a research field in computational linguistics that studies methods and techniques to simplify textual content. Text simplification methods should facilitate or at least speed up the adaptation of available and future textual material, making accessible information for all a reality. Usually (but not necessarily), adapted texts would have information loss and a simplistic style, which is not necessarily a bad thing if the message of the text, which was in the beginning complicated, can in the end be understood by the target reader. Text simplification has also been suggested as a potential pre-processing step for making texts easier to handle by generic text processors such as parsers, or to be used in specific information access tasks such as information extraction. Simplifying for people is more challenging than the second use of simplification because the output of the automatic system could be perceived as inadequate in the presence of the least error.

The interest in automatic text simplification has grown in recent years and in spite of the many approaches and techniques proposed, automatic text simplification is, as of today, far from perfect. The growing interest in text simplification is evidenced by the number of languages which are targeted by researchers worldwide. Simplification systems and simplification studies exist at least for English [Carroll et al., 1998, Chandrasekar et al., 1996, Siddharthan, 2002], Brazilian Portuguese [Aluísio and Gasperin, 2010], Japanese [Inui et al., 2003], French [Seretan, 2012], Italian [Barlacchi and Tonelli, 2013, Dell’Orletta et al., 2011], Basque [Aranzabe et al., 2012], and Spanish [Saggion et al.].

1.1 TEXT SIMPLIFICATION TASKS

Although there are many text characteristics which can be modified in order to make a text more readable or understandable, including the way in which the text is presented, automatic text simplification has usually concentrated on two different tasks—lexical simplification and syntactic simplification—each addressing different sub-problems.

Lexical simplification will attempt to either modify the vocabulary of the text by choosing words which are thought to be more appropriate for the reader (i.e., transforming the sentence “The book was magnificent” into “The book was excellent”) or to include appropriate definitions (e.g., transforming the sentence “The boy had tuberculosis.” into “The boy had tuberculosis, a disease of the lungs.”). Changing words in context is not an easy task because it is almost certain that the original meaning will be confused.

Syntactic simplification will try to identify syntactic phenomena in sentences which may hinder readability and comprehension in an effort to possibly transform the sentence into more readable or understandable equivalents. For example, relative or subordinate clauses or passive constructions, which may be very difficult to read by certain readers, could be transformed into simpler sentences or into active form. For example, the sentence “The festival was held in New Orleans, which was recovering from Hurricane Katrina” could be transformed without altering the original too much into “The festival was held in New Orleans. New Orleans was recovering from Hurricane Katrina.”

As we shall later see, automatic text simplification is related to other natural language processing tasks such as text summarization and machine translation. The objective of text summarization is to reduce a text to its essential content which might be useful in simplification on occasions where the text to simplify has too many unnecessary details. The objective of machine translation is to translate a text into a semantic equivalent in another language. A number of recent automatic text simplification approaches cast text simplification as statistical machine translation; however, this approach to simplification is currently limited by the scarcity of parallel simplification data.

There is an important point to mention here: although lexical and syntactic simplification usually have been addressed separately, they are naturally related. If during syntactic simplification a particular syntactic structure is chosen to replace a complex construction, it also might be necessary to apply transformations at the lexical level to keep the text grammatical. Furthermore, with a text being a coherent and cohesive unit, any change at a local level (words or sentences) certainly will affect in one way or another textual properties (at the local and global level): for example replacing a masculine noun with a feminine synonym during lexical simplification will certainly require some languages to repair local elements such as determiners and adjectives, as well as pronouns or definite expressions in following or preceding sentences. Pragmatic aspects of the text, such as the way in which the original text has been created to communicate a message to specific audiences, are generally ignored by current systems.

As we shall see in this book, most approaches treat text simplification as a sequence of transformations at the word or sentence level, disregarding the global textual content (previous and following text units), thereby affecting important properties such as cohesion and coherence.

1.2 HOW ARE TEXTS SIMPLIFIED?

Various studies have investigated ways in which a given text is transformed into an easier-to-read version. In order to understand what text transformations would be needed and what transformations could be implemented automatically, Petersen and Ostendorf [2007] performed an analysis of a corpus of original and abridged CNN news articles in English (114 pairs), distributed by the Literacyworks organization,¹ aimed at adult learners (i.e., native speakers of English with poor reading skills). They first aligned the original and abridged versions of the news articles looking for the occurrence of an original-version sentence corresponding to a sentence in the abridged version. After having aligned the corpus, they observed that sentences from the original documents can be dropped (around 30%) or aligned to one (47% of same sentences) or more sentences (19%) in the abridged version (splits). The one-to-one alignments correspond to cases where the original sentence is kept practically untouched, cases where only part of the original sentence is kept, and cases of major re-writing operations. A small fraction of pairs of the original sentences were also aligned to a single abridged sentence, accounting for merges. Petersen and Ostendorf’s study also tries to automatically identify sentences in the original document which should be split since those would be good candidates for simplification. Their approach consists of training a decision-tree learning algorithm (C4.5 [Quinlan, 1993]) to classify a sentence into split or nonsplit. They used various features including sentence length and several statistics on POS tags and syntactic constructions. Cross-validation evaluation experiments show that it is difficult to differentiate between the two classes; moreover, sentence length is the most informative feature, which explains much of the classification performance. Another interesting contribution is the study of dropped sentences, for which they train a classifier with some features borrowed from summarization research; however, the classifier is only slightly better than a majority baseline (i.e., not drop).

In a similar way, Bott and Saggion [2011b] and Drndarevic and Saggion [2012a,b] identified a series of transformations that trained editors apply to produce simplified versions of documents. Their case in notably different from Petersen and Ostendorf [2007] given the characteristics of the language—Spanish—and target population of the simplified text version: people with cognitive disabilities. Bott and Saggion [2011b] analyzed a sample of sentence-aligned original and simplified documents to identify expected simplification operations such as sentence split, sentence deletion, and various types of change operations (syntactic, lexical, etc.). Moreover, additional operations such as insertion and reordering were also documented. Drndarevic and Saggion [2012a,b] specifically concentrate on identifying lexical changes, in addition to synonym substitution, cases of numerical expression re-writing (e.g., rounding), named entity reformulation, and insertion of simple definitions. Like Petersen and Ostendorf [2007], Drndarevic and Saggion train a Support Vector Machine (SVM) algorithm [Joachims, 1998] to identify sentences which could be deleted, improving over a robust baseline that always deletes the last sentence of the document.

1.3 THE NEED FOR TEXT SIMPLIFICATION

The creation of text simplification tools without considering a particular target population could be justifiable in that aspects of text complexity affect a large range of users with reading difficulties. For example, long and syntactically complex sentences are generally hard to process. Some particular sentence constructions, such as syntactic constructions which do not follow the canonical subject-verb-object (e.g., passive constructions), may be an obstacle for people with aphasia [Devlin and Unthank, 2006] or an autism spectrum disorder (ASD) [Yaneva et al., 2016b]. The same is true for very difficult or specialized vocabulary and infrequent words which can also prove difficult to understand for people with aphasia [Carroll et al., 1998, Devlin and Unthank, 2006] and ASD [Norbury, 2005]. Moreover, there are also certain aspects of language that prove difficult to specific groups of readers. Language learners, for example, may have a good capacity to infer information, although they may have a very restricted lexicon and may not be able to understand certain grammatical constructions. Dyslexic readers, in turn, do not have a problem with language understanding per se, but with the understanding of the written representation of language. In addition, readers with dyslexia were found to read better when using more frequent and shorter words [Rello et al., 2013b]. Finally, people with intellectual disabilities may have problems processing and retaining large amounts of information [Fajardo et al., 2014, Feng et al., 2009].

In order to create adapted versions for specific populations, various initiatives exist which promote accessible texts. An early proposal is Basic English, a language of reduced vocabulary of just over 800 word forms and a restricted number of grammatical rules. It was conceived after World War II as a tool for international communication or a kind of interlingua [Ogden, 1937]. Other initiatives are Plain English (see “Language for Special Purposes” in Crystal [1987]), for English in the U.S. and U.K., and the Rational French, a French-controlled language to make technical documentation more accessible in the context of the aerospace industry [Barthe et al., 1999]. In Europe, there are associations dedicated to the adaptation of text materials (books, leaflets, laws, official documents, etc.) for people with disabilities or low literacy levels, examples of which are the Easy-to-Read Network in Scandinavian countries, the Asociación Lectura Fácil² in Spain, and the Centrum för Lättläst in Sweden.³ These associations usually provide guidance or recommendation about how to prepare or adapt textual material. Some such recommendations are as follows:

• use simple and direct language;

• use one idea per sentence;

• avoid jargon and technical terms;

• avoid abbreviations;

• structure text in a clear and coherent way;

• use one word per concept;

• use personalization; and

• use active voice.

These recommendations, although intuitive, are sometimes difficult to operationalize (for both humans and machines) and sometimes even impossible to follow, especially in the case of adapting an existing piece of text.

1.4 EASY-TO-READ MATERIAL ON THE WEB

Although adapted texts have been produced for many years, nowadays there is a plethora of simplified material on the Web. The Swedish “easy-to-read” newspaper 8 Sidor⁴ is published by the Centrum för Lättläst to allow people access to “easy news.” Other examples of similarly oriented online newspapers and magazines are the Norwegian Klar Tale,⁵ the Belgian l’Essentiel⁶ and Wablie,⁷ the Danish Radio Ligetil,⁸ the Italian Due Parole,⁹ and the Finnish Selo-Uutiset.¹⁰ For Spanish, the Noticias Fácil website¹¹ provides easy-to-read news for people with disabilities. The Literacyworks website¹² offers CNN news stories in original and abridged (or simplified) formats, which can be used as learning resources for adults with poor reading skills. At the European level, the Inclusion Europe website¹³ provides good examples of how full text simplifications and simplified summaries in various European languages can provide improved access to relevant information. The Simple English Wikipedia¹⁴ provides encyclopedic content which is more accessible than plain Wikipedia articles because of the use of simple language and simple grammatical structures. There are also initiatives which aim to give access to easy-to-read material in particular and web accessibility in general the status of a legal right.

The number of websites containing manually simplified material pointed out above clearly indicates a need for simplified texts. However, manual simplification of written documents is very expensive and manual methods will be not cost-effective, especially if we consider that news is constantly being produced and therefore simplification would, in turn, need to keep the same pace. Nevertheless, there is a growing need for methods and techniques to make texts more accessible. For example, people with learning disabilities who need simplified text constitute 5% of the population. However, according to data from the Easy-to-Read Network,¹⁵ if we consider people who cannot read documents with heavy information load or documents from authorities or governmental sources, the percentage of people in need of simplification jumps to 25% of the population.¹⁶ In addition, the need for simplified texts is becoming more important as the incidence of disability increases as the population ages.

1.5 STRUCTURE OF THE BOOK

Having briefly introduced what automatic text simplification is and the need for such technology, the rest of the book will cover a number of relevant research methods in the field which have been the object of scientific inquiry for more than 20 years. Needless to say, many relevant works will not be addressed here; however, we have tried to cover most of the techniques which have been used, or are being used, at the time of writing. In Chapter 2, we will provide an overview of the topic of readability assessment given its current relevance in many approaches to automatic text simplification. In Chapter 3, we will address techniques which have been proposed to address the problem of replacing words and phrases by simpler equivalents: the lexical simplification problem. In Chapter 4, we will cover techniques which can be used to simplify the syntactic structure of sentences and phases, with special emphasis on rule-based linguistically motivated approaches. Then in Chapter 5, machine learning techiques, optimization, and other statistical techniques to “learn” simplification systems will be described. Chapters 6 and 7 cover very related topics—in Chapter 6 we will present fully fledged text simplification systems which have as users specific target populations, while in Chapter 7, we will cover sub-systems or methods specifically based on targeted tasks or user characteristics. In Chapter 8, we will cover two important topics: the available datasets for experimentation in text simplification and the current text simplification evaluation techniques. Finally, in Chapter 9, we close with an overview of the field and critical view of the current state of the art.

¹ http://literacynet.org/

² http://www.lecturafacil.net/

³ http://www.lattlast.se/

⁴ http://8sidor.lattlast.se

⁵ http://www.klartale.no

⁶ http://www.journal-essentiel.be/

⁷ http://www.wablieft.be

⁸ http://www.dr.dk/Nyheder/Ligetil/Presse/Artikler/om.htm

⁹ http://www.dueparole.it

¹⁰ http://papunet.net/selko

¹¹ http://www.noticiasfacil.es

¹² http://www.literacyworks.org/learningresources/

¹³ http://www.inclusion-europe.org

¹⁴ http://simple.wikipedia.org

¹⁵ http://www.easytoread-network.org/

¹⁶Bror Tronbacke, personal communication, December 2010.

Подняться наверх